Does anyone know if silences in between words and periods (where sentence 1 ends and sentence 2 starts within the same utterance) affect the alignments and attention performance? I am not talking about trailing silences. I have been training Taco2 for a while now on a very clean dataset, which has no background noise and the transcriptions are all correct, but whenever I turn r=2, the performance is very erratic. The only reason I can think of is silences within the utterances, because the set does not have trailing silences at start or end, and all transcriptions are correct. The pitch is also good and the speaker takes good care to enunciate everything.
If silences are inconsistent and long, it degrades the performance. It is also the case in LJSpeech.
Thanks Eren! That is sad, they are inconsistent and long. I never shortened them because I thought they would make the model versatile, but as soon as it gets to r=2 it gets very fickle. What do you do for LJSpeech? Because I see that the model on ForwardAttn+BN converges in the stopnet right away and mine cannot get below 0.03, even though there are no trailing silences.