datset quality

Hi, Do you think that these sound quality are enougn to train tacotron model, I am having bad synthetised audio like these after 40 000 iterations
samples from training :



synthetised samples (during evaluation):


thanks for your help

Your files are not accessible to me (and I presume others). There’s a message saying they’re blocked by the site owner.

However even with access, I think it would be something of a challenging question.

A significant factor will be the total quantity of audio you’ve got to train with (which you didn’t mention) and then a rather intangible aspect is how consistent the audio is - you can have plenty of good quality audio and yet if it’s too varied and inconsistent then it will be a struggle to train a model - of course confirming “what’s consistent enough” is going to be nigh on impossible. Are your transcriptions accurate or are there potentially errors?

Have you tried training with standard datasets? If so, how did you get on with them? How does the training you’re doing with this data compare to that data? You often can’t make assumptions across datasets but at least it would give you reassurance that you’ve got the basics working and you might be able to see how artificially degrading the quality of a known dataset impacts the ability to train it, until you get something approaching your set. Those are just some areas to think about.