Can eSpeak transcription cause training/testing difficulties?

Hello all.
I’m training a model for Russian. And I just can’t make it good enough even for a simple phrases.
Actually I do it for a second time. The first dataset had around 24 hours of audio book speech, which I replaced now with 6 hours (for now, it is growing each day, I know, 6 hours is not sufficient amount), as the reader tries to mimic different voices from time to time. Now it is perfect, single speaker, monotonic speech. Duration is from 1 to 11 sec. No long silences and other obvious reasons for the bad learning.
My question is, when I tried for the first time, the model barely could say few phrases among a number of them by 200k step (batch size 32). Now, I learned the model for 190 000 steps (batch size 8), but it still can say nothing. TestAudios sound like a different language to me, I can not understand the sentences. The model is not overfitted.
My concern is, the eSpeak transcription. May it be the reason for such a bad model behavior? Just take a look at transcription it provides while training:
sentence = “я тебя не понимаю”
output on JupyterNotebook while testing:
^(en)aɪə tɛbiə niː pɒnɪmeɪuː(ru)~

Why there is (en) ? Shouldn’t it use only (ru) note?
and yeah, I see sometimes phrase “full dictionary is not installed for Ru” (eSpeak has no full dictionary for Russian language).
I think, if the eSpeak is used for both training and testing, it provides in both cases the same transcription, so it seems like it doesn’t matter if transcription is correct as long as it is stable. But though, I am worry a lot about it. I don’t know other reasons that could cause such a bad training.
I tried LJSpeech, and the model became good enough for understanding after around 30 000 steps…
Need your advices, guys. Parameters seem to be good for dataset, maybe be there is anything else I can try?
Thank you a lot for any suggestion.

try not to use phonemes and see if it is any different

Since Russian language is almost perfectly phonetic, there should be little need for using phonetic transcriptions. The only difficult point is word stress.

1 Like

Thank you for the advice!
I actually figured out the reason why eSpeak produces such a weird on my opinion transcription: I didn’t fix the symbols.py and cleaners.py. So my text was forced to be ascii. Now I made my own method for throwing out all Latin symbols (as in Russian there are only Cyrillic symbols), lower case it and so on. So now eSpeak produces the way more reasonable transcription.
The only consern left is attention alignments… Here below is the pic of 11k iteration (I just started training from scratch on ~10.5 Hours ds, I decided not to trim silence as I have it really short, batch size is 8 for training and 4 for testing).


I drew the red dotted line as it is supposed to be (as I suppose), so now I have it wrong. Will it most likely learn attention better? Or what could be the reason? I saw some info about stop_token, that I could tweak it a bit. But I think it would take a lot of time…
Should I leave it as it is and wait until the magic happened? Or I should stop here and reduce stop_token(increase?).
Thank you so much!
Have a good one!

@vcjobacc As far as I can see,you use use_forward_attn=True. If this is the case, let it to train more. If nothing changes, then try to use_forward_attn=false and location_attn=True. One of these should work. If not, let us know again so we can discuss further.

1 Like

Thank you. I’ve tried to wait more with forward_attention = true and with location_att = true instead. Unfortunatly it didn’t work.
Now I have extendet the dataset (its ~ 16 hours), as well as cleaning it more, so I am preatty sure its cleaness is around 97% or so, with no serious error.
The thing is, besides the attention, it couldn’t produce any word properly, not even anyhow close…
What I suspect is that the most of the sentences in the dataset are ended in the middle, so according to the intonation, while the sentence seems to be continued, it is ended (that’s how I’ve tried to insure I have no files longer than 10 secs).
If that’s the serious misstake, what duration is acceptable for files? Should I always use only the sentences with correct intonation?
Thank you a lot!
p.s. I attached some plots just in case it is needed.

Looks like that was my serious mistake to split sentences in the middle. I am training now with another ds, where almost all sentences are ended on the right spot, and by 20000 steps it is already started to align fine! And the sound it produces sounds promising! Thank you for all advices, hope now it will work!

1 Like

Have you succeeded? I’m going to train a model for russian as well. Going to use common voice dataset. Any tips for config.json?