Custom voice - TTS not learning

I’ve trained on LJSpeech to confirm a working set up.

I’ve created a custom dataset of ~15K utterances. AnalyzeDataset looks good after I filtered out outliers (> length 63 characters). CheckSpectrograms also checks out. My dataset is in LJSpeech format.

I think the best way to summarize the performance is this: after 20K iterations, all 4 test files in test_audios/2XXXX are the same length and the same length as the first 4 test files in 1XXX. In contrast, LJSpeech quickly shows divergence in files and file lengths.

I’m a bit stumped after having carefully checked file formats/sampling rates, etc. I know my data isn’t completely clean, but I would have expected some result. FWIW I’m using subtitle data for the character Cartman from South Park.

Possible reasons:

  • Sampling rate model vs data mismatch
  • Audio processor parameters do not match your dataset
  • Dataset has background noise or long silences.

You can also check https://github.com/mozilla/TTS/wiki/Dataset

I’m pretty confident the audio parameters are correct, though I’ll try again with the latest commits.

Background noise/silence is definitely an issue but a relatively rare one I think. This dataset is dialogue, so there are 1 word turns, e.g. “No!”, as well as pauses in the middle of a turn, e.g. “Well… I guess so.”

I’m looking at improving the word level alignments and manually cleaning. Any suggestions on dealing with this type of data (DVD audio/subtitle data) would be appreciated.

I trained a custom dataset and on my initial attempts was disheartened as I faced similar issues with it, even though I got good results with LJSpeech.

YMMV, but what I found, after doing clean up along the lines of @erogol’s third suggestion was that for quite a while during training it still gave terrible results (all the same kind of noise, as you seem to be observing) but then quite rapidly hit a point where it dramatically improved - not into perfect speech but something much more speech-like and distinct. I’m not at my computer and don’t recall the exact point but I think it was around the 20-30k point and things started to get pretty decent by 100k - so maybe you’re not going quite far enough?

The other thing that helps give a bit more feedback (which you may have been doing) is using Tensorboard as you can see how the results improve more clearly than just running it and checking the test files manually.

One last point: I had to throw out a bunch of my audio as it was incorrectly clipped. I wonder if perhaps with the subtitles you might have cases where the text doesn’t actually match what was said (it’s quite common for subtitles to be shortened from the spoken words) - would be hard to check but that could be throwing the model off possibly.

Best of luck! Would be great to hear how you get on. Here are some samples trained with my dataset: https://m.soundcloud.com/user-726556259/sets/tts-demo-updated-may-2019

One more additional trick to find bad instances is that to run your trained network on your training data and see the instances which have the highest loss value. These would possibly corresspond to the worse instances in your dataset. Do this until you are satisfied with the quality of your worse instances.

1 Like

Great suggestions, thank you both. When I solve it, I’ll post back to this thread :smile:

I redid the forced alignment and put some conservative filters on the output. The light blue line is my current run. Already it is further along (in sound quality) than any of my previous runs. Once I know this is working, I’ll link the code here
Screenshot%20from%202019-06-10%2019-32-54

2 Likes

How long did each iteration take, and how long did it take to get to the tipping point (~30K iterations)?

THX

Just responded in another thread: I’m doing a run on a 1080ti right now, and it’s done about 60K iterations in 15 hours. The speech is somewhat intelligible but not great yet :slight_smile: I wouldn’t say there’s been a tipping point; phonemes/syllables have slowly emerged from the beginning of the test utterance and increased in length over time.

FWIW I’ve decided I have to manually clean my data, so I’ve built a web-based tool to facilitate that process. Here’s the blog post with links to GitHub and walk-through video: https://olney.ai/category/2019/06/18/manualalignment.html

Once I’ve cleaned it up, I’ll post back to this thread with the training results.

1 Like

Hey @aolney, where you able to train your model? I’ve got the same problem (see Training - Custom voice doesn't train)

Still manually cleaning. It’s going to take a while…

1 Like

I’ve only cleaned about a third of my data, but it’s definitely working with about 2500 utterances. Major cleaning things I’ve done:

  • Stop/end times. There may be some silence at either end, but only about 10 ms on average
  • Noise. Any kind of noise I coded and separated. Music, hum, crickets, etc.
  • Yelling/pleading/unusual vocalizations. If it wasn’t subjectively normal speech, I separated it out.
  • Splitting larger utterances. Any utterance with pauses where the prosody seemed reasonable for splitting, I split. I was aggressive in this, since my original length distribution had a long tail.
  • Fixing incorrect transcript (obvious)

I’m using the manual cleaning webpage I linked before :wink:

Up to now, I’ve been using a training regime where I’d repeatedly add 500 utterances and continue a previous model run. However, I’m finding a new model run with all utterances has much better performance.

2 Likes

@aolney thx for sharing this. It is a great post for people having the same problem. Would you mind if I link your post from our TTS wiki?

Please feel free to link or copy/paste anything you feel would be helpful :thumbsup: