Guessing that this is over-fitting, but want to confirm

GuyEP · March 18, 2021, 4:53am

I have a question about how you can tell in the graphs when you are in over-fit territory. I saw that the wiki notes:

Stop the training, if your model starts to overfit (validation loss increases as training loss stays the same or decreases). Sometimes, the attention module overfits as well without noticing from the loss values. It is observed when the attention alignment is misaligned at test examples but train and validation examples. If your final model does not work well at this stage, you can retrain the model with higher weight decay, larger dataset, or bigger dropout rate.

Which graph(s) should I be paying attention to for this?
Do the graph(s) suggest that adding more data to the training set would potentially help?

Here’s what I have:

stiles-ddc-March-15-2021_03+28AM-547bfc4 #4

Sample audio from select runs:
https://1drv.ms/u/s!AigXEMxuXyh0gelWEP4lb87JZIU9XQ?e=a1hEJn

More background on the model:
I have a pretty small training set right now – just about 200 clips – but the result is actually not as bad as you’d expect. I did transfer learning from the LJSpeech Tacotron DDC model and some of my validation clips sound almost human and others have a fair amount of slurring/mumbling for certain words.

The “smoothness” of the audio has gotten better the longer I’ve trained, but the slurring hasn’t; in other words, the utterances flow together better with less noise but the pronunciation doesn’t improve and the voice sounds more synthetic the more iterations I go. Meanwhile, the “TTS_EvalAudios/ValAudio” is sounding more and more degraded with each iteration (I’m guessing this is caused by increasing avg_loss?).

I am in the process of pruning audio clips that I think are causing this; there are a few situations in which the speaker elides/shortcuts a syllable, usually at the beginning or end of an utterance, and my hypothesis is that attention is learning to shortcut that phoneme, leading to the slurring. Either way, I assume pulling out clips so that the speech is more uniform can’t hurt; I’ve been through several prune-and-train iterations before and each time it’s produced better results.