Early overfit

lissyx · April 6, 2020, 8:47am

That will force you to have a small batch size

You might want to have a look at training and validation loss evolution and disable / tune the early stop parameters. Defaults one are not that good for general use.

What do you mean here ?

Either you have an issue in your data, or it’s your training that is wrong. With 1000 hours you should have much much better results.

What’s your language model ?

sixtease · April 6, 2020, 9:00am

The loss evolution is as follows for training / dev:

196.224552 168.403517 
155.266635 144.368254 
133.005606 129.608997 
118.243783 120.167001 
107.356690 113.943017 
 98.876653 109.781181 
 91.966301 108.019155 
 86.008403 105.392287 
 80.699095 103.264612 
 76.000882 103.242084
 71.748770 104.472738
 67.746588 105.081310

I encode the Czech diacritic characters as follows:

á 'a, é 'e, ě 'j, í 'i, ó 'o, ú 'u, ů 'w, ý 'y
ž 'z, š 's, č 'c, ř 'r, ď 'd, ť 't, ň 'n

I have trained the language model on a superset of the transcripts. Something like 2.5 x the amount of what’s in the training data. Trained by current master branch of KenLM, built binary by the same and build trie by DeepSpeech’s native_client binary.

lissyx · April 6, 2020, 9:04am

I am unsure to understand what you are doing. Why not just using plain UTF-8 ?

Do you mind sharing exact steps ? Have you verified you are using the exact same alphabet file ?

sixtease · April 6, 2020, 9:16am

I was not sure how stable the UTF-8 support is so encoding was the first thing I tried. It worked for me before, when I was training a model in early 2019.

I build the language model like this:

build current KenLM
get and unzip this file: http://commondatastorage.googleapis.com/sixtease/junk/parliament-corpus-ascii.txt.gz
lmplz -o 3 < parliament-corpus-ascii.txt > lm.arpa
build_binary lm.arpa lm.binary
bazel-bin/native_client/generate_trie alphabet.txt lm.binary trie

lissyx · April 6, 2020, 9:31am

I’m not sure what you are referring to. I’m not talking about the --utf8 mode that works without an alphabet, just using UTF8-encoded data in your dataset and alphabet.

You might want to properly re-do as in data/generate_lm.py.

sixtease · April 10, 2020, 10:11am

@lissyx Thank you very much for your help so far. I have tried generating a language model copying the steps in data/generate_lm.py. The result was basically the same: Early stop after 11 epochs. I have tried training with CommonVoice and the lm present in DS package but I have been getting a segfault similar to this one Segmentation fault which also seems to be LM-related if I understand it correctly. I don’t know what to try next.

[UPDATE]: Aha, the lm.binary and trie files have just 133 bytes each. Something’s wrong, I’ll dig into int.

[UPDATE 2]: No, I keep getting the segmentation fault early in the 1st epoch of training, even with the correct downloaded LM.

lissyx · April 10, 2020, 10:27am

Please make sure you followed the setup accurately, especially that you have matching versions between deepspeech and ds_ctcdecoder.

We can’t provide help without more context on what you do precisely.

sixtease · April 13, 2020, 4:46am

I have managed to train with Common Voice. The problem occurs even there, early stop after 9 epochs.

Loss evolution:

dev    train
-----  -----
72.04  65.58
67.19  57.35
65.88  52.53
63.44  49.45
62.53  47.12
62.56  45.49
62.01  44.04
61.13  42.57
60.86  41.75
61.29  41.19

I had an error in the test data, so I’ll have to wait for test WER a bit.
I am also a bit baffled why the early stop occurred after just one single increase in dev loss.
[UPDATE]: Test WER: 0.606421

othiele · April 14, 2020, 8:13am

For German and an input of 1000 hours you should get better results. So you might have bad, in-congruent data or somehow your alphabet doesn’t represent Czech?

Maybe there is a good Czech dataset of 100 or so hours that you could use for training?

As for the early stop, that’s OK for your data, I wouldn’t expect miracles with more epochs, but you can change them with parameters if you want to.

sixtease · April 14, 2020, 8:14am

My last post is about Common Voice English with the provided language model. Something must be wrong with my setup.

othiele · April 14, 2020, 8:39am

I haven’t run that, but I would guess you get down to about WER 0.15 in about 15 epochs without early stop.

I would suggest you start over with an empty setup, usually I find something that I shouldn’t have done What parameters did you call DeepSpeech with for that result?

sixtease · April 14, 2020, 8:43am

./DeepSpeech.py \
    --alphabet_config_path "$ASRH/res/alphabet.txt" \
    --checkpoint_dir "$ASRH/temp/checkpoints" \
    --checkpoint_secs 900 \
    --dev_batch_size 80 \
    --dev_files "$ASRH/dev.csv" \
    --export_dir "$ASRH/model" \
    --lm_binary_path "$ASRH/data/lm/lm.binary" \
    --lm_trie_path "$ASRH/data/lm/trie" \
    --max_to_keep 3 \
    --summary_dir "$ASRH/temp/summaries" \
    --test_batch_size 80 \
    --test_files "$ASRH/test.csv" \
    --train_batch_size 100 \
    --train_files "$ASRH/train.csv"

othiele · April 14, 2020, 10:01am

You give no cudnn, learning rate nor dropout. I would take a much higher dropout like 0.25 and results should improve dramatically, standard learning rate should be fine, cudnn will speed up things.

sixtease · April 14, 2020, 10:06am

What do you mean I give no CuDNN? I have configured CuDNN usage via environment variables. Do I have to specify some DeepSpeech options for proper CUDA use?

othiele · April 14, 2020, 10:13am

I think it is use_cudnn_rnn if I remember correctly, but that’s mainly for speed, not WER.

sixtease · April 14, 2020, 10:24am

I was giving no such options and GPUs were clearly used, judging 1) from speed and 2) from the fact that the number of steps halved when I used two GPUs instead of 1. Now that I tried with the --use-cudnn_rnn option, I get an error No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams'. Is this something I should dig into or is it fine to just continue without the option?

othiele · April 14, 2020, 11:24am

The dropout should give you much better results. You might even try values like 0.4 in a later run, depending on the dataset.

@lissyx should know whether --use-cudnn_rnn would give you much better results if you are already on GPU support. I haven’t trained without it for some time.

lissyx · April 14, 2020, 11:49am

Make sure you are not re-starting from an older checkpoint.
Make sure you use proper CUDNN and CUDA versions.

The --use_cudnn_rnn should not change how then network converges, but it will use CUDNN-optimized TensorFlow LSTM cells that processes faster.

Early stop default parameters are not super-reliable. You need to do you own analysis and tuning of them.

sixtease · April 20, 2020, 3:41pm

I have tried training CommonVoice English, as you adviced me, @lissyx, and I have used the parameters like in .compute, except I had a greater batch size of 65 to speed the experiment up. It ended up like with my original dataset: early overfit, great test WER of 0.588034. Here is the loss evolution:

train         dev
107.415366    90.988997
 77.633969    79.132081
 66.763453    73.079587
 59.810548    68.768595
 54.691266    65.851911
 50.673932    64.093251
 47.387396    62.118382
 44.590231    62.041358
 42.132208    61.115278
 39.956110    60.476461
 37.953228    60.269572
 36.213714    59.242173
 34.556712    59.173234 *
 33.081850    59.734880
 31.704943    59.559893
 29.725658    60.446697
 28.603038    59.974832
 27.602044    60.964456
 26.651280    61.209080
 25.728388    62.034175
 24.916874    61.714609
 24.130870    62.894777
 23.336425    62.838955
 22.688621    63.969326
 22.020799    64.639862
 21.306159    64.927457
 20.746867    65.340298
 20.205466    65.611585
 19.624681    67.308286
 19.077871    67.558523
 18.554258    69.291947
 18.068047    69.521966
 17.536386    71.407673

I used these parameters:

DeepSpeech.py \
    --alphabet_config_path "data/alphabet.txt" \
    --checkpoint_dir "checkpoints" \
    --dev_batch_size 65 \
    --dev_files "dev.csv" \
    --export_dir "model" \
    --lm_binary_path "data/lm/lm.binary" \
    --lm_trie_path "data/lm/trie" \
    --summary_dir "summaries" \
    --test_batch_size 65 \
    --test_files "test.csv" \
    --train_batch_size 65 \
    --learning_rate 0.0001 \
    --dropout_rate 0.2 \
    --n_hidden 2048 \
    --use_cudnn_rnn \
    --noearly_stop \
    --train_files "train.csv"

There must be something wrong and I still have no clue. I believe I followed the instructions quite precisely. It happens also on different hardware. It worked maybe a year ago, I had a master checkout, I don’t know exactly which commit. Then I updated to v0.6.1 and since then this happens. It may not have anything to do with the update, I don’t know. I tried re-cloning to no avail.

Please, how can I go about identifying the cause of this behavior? Or what solution would you suggest trying next? Thank you sincerely.

lissyx · April 20, 2020, 3:45pm

Don’t, those are specific to our cluster