Attention makes repetitions after long training (after converging successfully)

I’m training tacotron 2 with the default config (with a few audio config tweaks) on one french speaker of the m-ailabs dataset (50h).
I noticed after the 50K iter dip in loss (due to the default gradual training parameters) that while the loss went down and the alignment score went up, the alignment started to make repetitions. Mostly on short sentences but I noticed artefacts on longer sentences as well.
I tried training for a longer time (150k iter) hoping that it would disappear but no luck.

Has anyone encountered the same issues ?


Hi Julian,
You mentioned using the m-ailabs dataset, but I couldn’t quite make out in your screenshots how long the sentences appear to be. What is your value for max_seq_len?

It might help if you could post your config.json

Whilst I haven’t seen the repetitions in the charts like you show, I’ve definitely experienced cases where the output gets into a loop (either with words or more often short snippets of sound repeating)

The config might suggest other causes but I suspect you’ll have more trouble with this sort of thing if the sequence length is overly long or the distribution of sentence lengths being trained with isn’t ideal for the range you’re using (I’m not aware how m-ailabs stacks up in this regard).
Kind regards,

Thanks for your quick response,

Yeah I think that the loops you experienced is the same thing that I have. For the audio to loop, I think that you have to have an attention plot like mine since the same word in the input is repeated in the output. (Btw it only happen on some of my sentences not on others)

Here is the link to my config.json :
As you can see, max_seq_len = 150 in my case, I first had problem with attention not converging so I turned it down a notch.

I also ran the AnnalyzeDataset.ipynb notebook so I could have some more insights.

Here is the distribution of length of my dataset.

Do you think that training with just sentences < 90 in length for a little while could fix that ?

Thanks. So it doesn’t look like it would be due to the sentence length as 150 is quite reasonable and the distribution of that chart up to 150 characters looks healthy.

I suppose you could try a shorter length, like 90 as you say, and see if that helps. I might be tempted to try setting use forward attention too.

One last thing: do you have a sense that the particular speaker you’ve selected has consistent driver delivery? Generally having broadly similar speaking styles over the whole dataset helps the model learn. If the speaker is too expressive or emotional or their speech goes up and down in pitch or the speed of delivery varies from quick to slow, these will all add to the challenges with the model. (sorry, I know that’s kind of obvious and you’ve probably thought of it already!! :slightly_smiling_face: )

It might be that your dataset has some untrimmed silences at the beginning or the end.