I’m training a model for Russian. And I just can’t make it good enough even for a simple phrases.
Actually I do it for a second time. The first dataset had around 24 hours of audio book speech, which I replaced now with 6 hours (for now, it is growing each day, I know, 6 hours is not sufficient amount), as the reader tries to mimic different voices from time to time. Now it is perfect, single speaker, monotonic speech. Duration is from 1 to 11 sec. No long silences and other obvious reasons for the bad learning.
My question is, when I tried for the first time, the model barely could say few phrases among a number of them by 200k step (batch size 32). Now, I learned the model for 190 000 steps (batch size 8), but it still can say nothing. TestAudios sound like a different language to me, I can not understand the sentences. The model is not overfitted.
My concern is, the eSpeak transcription. May it be the reason for such a bad model behavior? Just take a look at transcription it provides while training:
sentence = “я тебя не понимаю”
output on JupyterNotebook while testing:
^(en)aɪə tɛbiə niː pɒnɪmeɪuː(ru)~
Why there is (en) ? Shouldn’t it use only (ru) note?
and yeah, I see sometimes phrase “full dictionary is not installed for Ru” (eSpeak has no full dictionary for Russian language).
I think, if the eSpeak is used for both training and testing, it provides in both cases the same transcription, so it seems like it doesn’t matter if transcription is correct as long as it is stable. But though, I am worry a lot about it. I don’t know other reasons that could cause such a bad training.
I tried LJSpeech, and the model became good enough for understanding after around 30 000 steps…
Need your advices, guys. Parameters seem to be good for dataset, maybe be there is anything else I can try?
Thank you a lot for any suggestion.