Early overfit

Hello, I trained a model on a custom dataset, ascii-encoded Czech, a bit over 1000 hours. I got an early stop after 12 epochs. Test WER 0.97. Happened with batch size 50 on 2 GPUs, learning rate 0.0001. Also with default learning rate after 6 epochs. I use DeepSpeech v0.6.1, CUDA 10.0, CuDNN 7.6. Any ideas what the matter might be?

2 ideas:

(1) Your data might not be good for training. Are transcripts aligning with the wavs? Are they in the right format? Do you have too much noise?

(2) Your setup might have an issue. Take some common voice, librivox dataset of 100 hours English and train it for some epochs. Same issues?

Thank you @othiele for the tips. The data seem to be all right. An occasional 0.1s overlap of a previous word occurs, and the numerals are spelled in digits, that’s as many problems as I could find. The format is RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz.

I shall try to train on CommonVoice and get back to you.

[update]: Oh yes, and the sample length is 12 - 30 seconds, if that could be an issue.

That will force you to have a small batch size

You might want to have a look at training and validation loss evolution and disable / tune the early stop parameters. Defaults one are not that good for general use.

What do you mean here ?

Either you have an issue in your data, or it’s your training that is wrong. With 1000 hours you should have much much better results.

What’s your language model ?

The loss evolution is as follows for training / dev:

196.224552 168.403517 
155.266635 144.368254 
133.005606 129.608997 
118.243783 120.167001 
107.356690 113.943017 
 98.876653 109.781181 
 91.966301 108.019155 
 86.008403 105.392287 
 80.699095 103.264612 
 76.000882 103.242084
 71.748770 104.472738
 67.746588 105.081310

I encode the Czech diacritic characters as follows:

á 'a, é 'e, ě 'j, í 'i, ó 'o, ú 'u, ů 'w, ý 'y
ž 'z, š 's, č 'c, ř 'r, ď 'd, ť 't, ň 'n

I have trained the language model on a superset of the transcripts. Something like 2.5 x the amount of what’s in the training data. Trained by current master branch of KenLM, built binary by the same and build trie by DeepSpeech’s native_client binary.

I am unsure to understand what you are doing. Why not just using plain UTF-8 ?

Do you mind sharing exact steps ? Have you verified you are using the exact same alphabet file ?

I was not sure how stable the UTF-8 support is so encoding was the first thing I tried. It worked for me before, when I was training a model in early 2019.

I build the language model like this:

  1. build current KenLM
  2. get and unzip this file: http://commondatastorage.googleapis.com/sixtease/junk/parliament-corpus-ascii.txt.gz
  3. lmplz -o 3 < parliament-corpus-ascii.txt > lm.arpa
  4. build_binary lm.arpa lm.binary
  5. bazel-bin/native_client/generate_trie alphabet.txt lm.binary trie
1 Like

I’m not sure what you are referring to. I’m not talking about the --utf8 mode that works without an alphabet, just using UTF8-encoded data in your dataset and alphabet.

You might want to properly re-do as in data/generate_lm.py.

@lissyx Thank you very much for your help so far. I have tried generating a language model copying the steps in data/generate_lm.py. The result was basically the same: Early stop after 11 epochs. I have tried training with CommonVoice and the lm present in DS package but I have been getting a segfault similar to this one Segmentation fault which also seems to be LM-related if I understand it correctly. I don’t know what to try next.

[UPDATE]: Aha, the lm.binary and trie files have just 133 bytes each. Something’s wrong, I’ll dig into int.

[UPDATE 2]: No, I keep getting the segmentation fault early in the 1st epoch of training, even with the correct downloaded LM. :frowning:

Please make sure you followed the setup accurately, especially that you have matching versions between deepspeech and ds_ctcdecoder.

We can’t provide help without more context on what you do precisely.

1 Like

I have managed to train with Common Voice. The problem occurs even there, early stop after 9 epochs.

Loss evolution:

dev    train
-----  -----
72.04  65.58
67.19  57.35
65.88  52.53
63.44  49.45
62.53  47.12
62.56  45.49
62.01  44.04
61.13  42.57
60.86  41.75
61.29  41.19

I had an error in the test data, so I’ll have to wait for test WER a bit.
I am also a bit baffled why the early stop occurred after just one single increase in dev loss.
[UPDATE]: Test WER: 0.606421

For German and an input of 1000 hours you should get better results. So you might have bad, in-congruent data or somehow your alphabet doesn’t represent Czech?

Maybe there is a good Czech dataset of 100 or so hours that you could use for training?

As for the early stop, that’s OK for your data, I wouldn’t expect miracles with more epochs, but you can change them with parameters if you want to.

My last post is about Common Voice English with the provided language model. Something must be wrong with my setup.

I haven’t run that, but I would guess you get down to about WER 0.15 in about 15 epochs without early stop.

I would suggest you start over with an empty setup, usually I find something that I shouldn’t have done :slight_smile: What parameters did you call DeepSpeech with for that result?

./DeepSpeech.py \
    --alphabet_config_path "$ASRH/res/alphabet.txt" \
    --checkpoint_dir "$ASRH/temp/checkpoints" \
    --checkpoint_secs 900 \
    --dev_batch_size 80 \
    --dev_files "$ASRH/dev.csv" \
    --export_dir "$ASRH/model" \
    --lm_binary_path "$ASRH/data/lm/lm.binary" \
    --lm_trie_path "$ASRH/data/lm/trie" \
    --max_to_keep 3 \
    --summary_dir "$ASRH/temp/summaries" \
    --test_batch_size 80 \
    --test_files "$ASRH/test.csv" \
    --train_batch_size 100 \
    --train_files "$ASRH/train.csv"

You give no cudnn, learning rate nor dropout. I would take a much higher dropout like 0.25 and results should improve dramatically, standard learning rate should be fine, cudnn will speed up things.

What do you mean I give no CuDNN? I have configured CuDNN usage via environment variables. Do I have to specify some DeepSpeech options for proper CUDA use?

I think it is use_cudnn_rnn if I remember correctly, but that’s mainly for speed, not WER.

I was giving no such options and GPUs were clearly used, judging 1) from speed and 2) from the fact that the number of steps halved when I used two GPUs instead of 1. Now that I tried with the --use-cudnn_rnn option, I get an error No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams'. Is this something I should dig into or is it fine to just continue without the option?

The dropout should give you much better results. You might even try values like 0.4 in a later run, depending on the dataset.

@lissyx should know whether --use-cudnn_rnn would give you much better results if you are already on GPU support. I haven’t trained without it for some time.