I have tried training CommonVoice English, as you adviced me, @lissyx, and I have used the parameters like in .compute, except I had a greater batch size of 65 to speed the experiment up. It ended up like with my original dataset: early overfit, great test WER of 0.588034. Here is the loss evolution:
There must be something wrong and I still have no clue. I believe I followed the instructions quite precisely. It happens also on different hardware. It worked maybe a year ago, I had a master checkout, I don’t know exactly which commit. Then I updated to v0.6.1 and since then this happens. It may not have anything to do with the update, I don’t know. I tried re-cloning to no avail.
Please, how can I go about identifying the cause of this behavior? Or what solution would you suggest trying next? Thank you sincerely.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
23
Don’t, those are specific to our cluster
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
24
There has been a lot of noise on that thread, could you please recap exactly your status, the problem we are trying to address here?
I have a large dataset on Czech but I am getting an overfit after just a few epochs. The test WER is about 0.9. To rule out that the problem is in data or LM, I have trained on Common Voice English with the distributed language model. I also get similar results: overfit after a few epochs, very large test WER.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
26
Which is … Expected ? The documented English hyper-parameters are for ~3500-4000h hours of English, with other datasets than Common Voice. Achieving 58% WER on only Common Voice with those might be quite good … And to me it would just confirm your setup is more-or-less fine for Common Voice English.
I see. I am still baffled. Because I had trained a model on single speaker data with about 100 hours and I got to 0.2 WER. I tried a recently and got to 0.9 WER. I will retry, maybe I made a mistake. And maybe there’s something wrong with my 1KH Czech dataset but I fail to see the problem and it looks to me like the training performs way worse than it did.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
28
I shall try out re-training on the small single speaker dataset and to find if my large dataset has no substantial errors in it. Thank you for your kind help.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
36
I’m really not sure this is a good idea, you might have different hyper-parameters because of that.
I would rather start from a subset of this, with default values and lm alpha/beta to 0.0, try to get learning rate and dropout fixed should be enough, then you can work on the LM parameters ?
What do you mean? I’m not implying I would train on the small dataset and then continue with the large one. I mean I’d try to get the same error rate I did before.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
38
Okay, I misunderstood and lost track then. There is no good reason you cannot reproduce your results, but be aware a lot might have changed and you might have to iterate before reproducing. Make sure you disable automatic mixed precision when you try to exactly replicate something, I have found and documented this can add small numerical instability that adds a little of variance in the end results.
It was the language model after all. Without it, the performance is solid. With it, it is catastrophic. That was also what you suggested at the very beginning @lissyx. Now just to find where I’m mistaken during the building of it.
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
40
Right, we are progressing, thanks!
Just make sure you are following the exact steps documented in your version’s data/lm, and that you do use the proper alphabet (same ordering, same content).
I have to admit I really don’t understand where the pain points are on building the LM, it’s really not complicated but people seems to struggle a lot.
Feedback / doc improvements would be welcome, if you have some.