Stopnet loss got high and not reducing / dataset change

Hello everyone!

Since I’ve got limited hardware and I want to speed up training as much as I can, I first pretrain tts on short utterances, and then I change the dataset (same speaker) to the one with longer utterances (and smaller batch size).
The thing is, the stopnet got much worse and seems like it’s not going to be improved. I use separate stopnet. Here what plots look like:

Is there any way it can be fixed? Can I somehow increase LR of stopnet or freeze the Tacotron’s weight for a while?

Thank you!!

I’d check longer samples if they have silence at the end and how long this offset is. There might be a difference b/w short and long samples and this affects stopnet after you switch.

They have no silence on the end, indeed! I’ve checked all records, they have like 0.1 - 0.2 sec silence on the end.
According to the attention plots, attention is also struggling. If listen to the eval and test audios, they usually end with a silence, like multiple seconds.


How is the performance when testing it in novel sentences? I think it is overfitting looking at the graphs, but I might be wrong. But the alignment graph also does not seem that it is capturing a lot of information.

@georroussos
it depends. Some are good, some are bad. The thing is, I used multiple sentences at once in dataset. On the small dataset there were primarily one-sentence utterances, and now it’s mostly multiple-sentence utterances. It does impact the performance. Anything I can do about it?

all test files are multisentence. But it skips last sentences from some of them.

@erogol May it help to train stopnet NOT separately?

Okay, if stopnet is the only problem, you can try not training it when you restore the model that has been trained on smaller utterances. Generally, silences are not very good for attention. If what you want is for the TTS to sound natural in between sentences, pairs (two sentences at most per wav) are enough for the TTS to pick up on silence modelling. I imagine my workflow in your position would be to first evaluate the quality of the training set (sound and content, e.g. if transcriptions match what is spoken). After that, do what you did and train on the short sentences. And after this is robust enough, train on a dataset that has at most two sentences per wav file and the silence between them is not longer than 0.100 secs. Also, take care that all other silences are short (while the speaker is talking).

@georroussos Thank you!
But how do I freeze weight for the stopnet and train the rest of the tacotron only?