Early overfit

Make sure you are not re-starting from an older checkpoint.
Make sure you use proper CUDNN and CUDA versions.

The --use_cudnn_rnn should not change how then network converges, but it will use CUDNN-optimized TensorFlow LSTM cells that processes faster.

Early stop default parameters are not super-reliable. You need to do you own analysis and tuning of them.

1 Like

I have tried training CommonVoice English, as you adviced me, @lissyx, and I have used the parameters like in .compute, except I had a greater batch size of 65 to speed the experiment up. It ended up like with my original dataset: early overfit, great test WER of 0.588034. Here is the loss evolution:

train         dev
107.415366    90.988997
 77.633969    79.132081
 66.763453    73.079587
 59.810548    68.768595
 54.691266    65.851911
 50.673932    64.093251
 47.387396    62.118382
 44.590231    62.041358
 42.132208    61.115278
 39.956110    60.476461
 37.953228    60.269572
 36.213714    59.242173
 34.556712    59.173234 *
 33.081850    59.734880
 31.704943    59.559893
 29.725658    60.446697
 28.603038    59.974832
 27.602044    60.964456
 26.651280    61.209080
 25.728388    62.034175
 24.916874    61.714609
 24.130870    62.894777
 23.336425    62.838955
 22.688621    63.969326
 22.020799    64.639862
 21.306159    64.927457
 20.746867    65.340298
 20.205466    65.611585
 19.624681    67.308286
 19.077871    67.558523
 18.554258    69.291947
 18.068047    69.521966
 17.536386    71.407673

I used these parameters:

DeepSpeech.py \
    --alphabet_config_path "data/alphabet.txt" \
    --checkpoint_dir "checkpoints" \
    --dev_batch_size 65 \
    --dev_files "dev.csv" \
    --export_dir "model" \
    --lm_binary_path "data/lm/lm.binary" \
    --lm_trie_path "data/lm/trie" \
    --summary_dir "summaries" \
    --test_batch_size 65 \
    --test_files "test.csv" \
    --train_batch_size 65 \
    --learning_rate 0.0001 \
    --dropout_rate 0.2 \
    --n_hidden 2048 \
    --use_cudnn_rnn \
    --noearly_stop \
    --train_files "train.csv"

There must be something wrong and I still have no clue. I believe I followed the instructions quite precisely. It happens also on different hardware. It worked maybe a year ago, I had a master checkout, I don’t know exactly which commit. Then I updated to v0.6.1 and since then this happens. It may not have anything to do with the update, I don’t know. I tried re-cloning to no avail.

Please, how can I go about identifying the cause of this behavior? Or what solution would you suggest trying next? Thank you sincerely.

Don’t, those are specific to our cluster

There has been a lot of noise on that thread, could you please recap exactly your status, the problem we are trying to address here?

I have a large dataset on Czech but I am getting an overfit after just a few epochs. The test WER is about 0.9. To rule out that the problem is in data or LM, I have trained on Common Voice English with the distributed language model. I also get similar results: overfit after a few epochs, very large test WER.

Which is … Expected ? The documented English hyper-parameters are for ~3500-4000h hours of English, with other datasets than Common Voice. Achieving 58% WER on only Common Voice with those might be quite good … And to me it would just confirm your setup is more-or-less fine for Common Voice English.

I see. I am still baffled. Because I had trained a model on single speaker data with about 100 hours and I got to 0.2 WER. I tried a recently and got to 0.9 WER. I will retry, maybe I made a mistake. And maybe there’s something wrong with my 1KH Czech dataset but I fail to see the problem and it looks to me like the training performs way worse than it did.

That sounds like overfitting …

No no, it works fine, I have the result on about 1000 hours and I have no precise numbers but the result is fairly usable.

I don’t find any reference to the older version of DeepSpeech you were working on, but maybe you have to update hyper-parameters, etc ?

Is there a guide for tuning the hyper-parameters?

No, it depends on your dataset, there’s no “documentation” that can be made …

I mean which ones there are, what values they can take and what they actually stand for.

This is all documented in the code, in the releases and in --helpfull.

1 Like

I shall try out re-training on the small single speaker dataset and to find if my large dataset has no substantial errors in it. Thank you for your kind help.

I’m really not sure this is a good idea, you might have different hyper-parameters because of that.

I would rather start from a subset of this, with default values and lm alpha/beta to 0.0, try to get learning rate and dropout fixed should be enough, then you can work on the LM parameters ?

What do you mean? I’m not implying I would train on the small dataset and then continue with the large one. I mean I’d try to get the same error rate I did before.

Okay, I misunderstood and lost track then. There is no good reason you cannot reproduce your results, but be aware a lot might have changed and you might have to iterate before reproducing. Make sure you disable automatic mixed precision when you try to exactly replicate something, I have found and documented this can add small numerical instability that adds a little of variance in the end results.

1 Like

It was the language model after all. Without it, the performance is solid. With it, it is catastrophic. That was also what you suggested at the very beginning @lissyx. Now just to find where I’m mistaken during the building of it.

1 Like

Right, we are progressing, thanks!

Just make sure you are following the exact steps documented in your version’s data/lm, and that you do use the proper alphabet (same ordering, same content).

I have to admit I really don’t understand where the pain points are on building the LM, it’s really not complicated but people seems to struggle a lot.

Feedback / doc improvements would be welcome, if you have some.