How to trained a model for common voice dataset using deepspeech v0.6.1?

Im following this link https://deepspeech.readthedocs.io/en/v0.6.1/TRAINING.html#continuing-training-from-a-release-model to train a model on common voice dataset.

Here I mean I am getting validation loss always greater than training loss in every scenerio.

That does not answer:

  • all your command line parameters
  • which common voice dataset you are using exactly

That’s quite vague, and we don’t have a training log to have a reference …

Im using common voice dataset for english language. And the parameters Im using are:
–n_hidden 2048 --learning_rate 0.000001 --dropout_rate 0.5

We are still lacking:

  • version of common voice you are training with
  • version of deepspeech checkpoints you are training with
  • training log

Can you please share all informations at once instead of having me asking again and again ?

1 Like

ok. So Im using deepspeech v0.6.1 and downloaded the checkpoints for the same version. And common voice dataset is downloaded from https://voice.mozilla.org/en/datasets.
Im getting the below result.

Test on /home/user/en/clips/test.csv - WER: 0.422665, CER: 0.253988, loss: 44.272888
--------------------------------------------------------------------------------
WER: 6.000000, CER: 3.222222, loss: 193.430511
 - wav: file:///home/user/en/clips/common_voice_en_54384.wav
 - src: "undefined"
 - res: "everything on her and he banterer "
--------------------------------------------------------------------------------
WER: 3.750000, CER: 3.882353, loss: 363.831543
 - wav: file:///home/user/en/clips/common_voice_en_17645060.wav
 - src: "did you know that"
 - res: "the two road the du know that did you know that they do know that did you know that"
--------------------------------------------------------------------------------
WER: 2.666667, CER: 0.655172, loss: 120.730492
 - wav: file:///home/user/en/clips/common_voice_en_125325.wav
 - src: "elizabeth reclined gracefully"
 - res: "it is a bet to an integrate full"
--------------------------------------------------------------------------------
WER: 2.285714, CER: 1.928571, loss: 343.015198
 - wav: file:///home/user/en/clips/common_voice_en_17832183.wav
 - src: "as you sow so shall you reap"
 - res: "i just she didn't fall it all over myself i just sit in front at all"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.000000, loss: 20.358313
 - wav: file:///home/user/en/clips/common_voice_en_191353.wav
 - src: "amen"
 - res: "the men"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.600000, loss: 21.027395
 - wav: file:///home/user/en/clips/common_voice_en_18442278.wav
 - src: "behave yourself"
 - res: "the head or self"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.785714, loss: 33.058380
 - wav: file:///home/user/en/clips/common_voice_en_17267925.wav
 - src: "any volunteers"
 - res: "in a woman to"
--------------------------------------------------------------------------------
WER: 1.833333, CER: 1.285714, loss: 141.307816
 - wav: file:///home/user/en/clips/common_voice_en_680693.wav
 - src: "find me the saga air cavalry"
 - res: "fin made the aga i will cover for time the saga a cabal"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 0.600000, loss: 42.782768
 - wav: file:///home/user/en/clips/common_voice_en_18429519.wav
 - src: "ideas are uncopyrightable"
 - res: "idea for an operator well"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 0.666667, loss: 45.606365
 - wav: file:///home/user/en/clips/common_voice_en_2421.wav
 - src: "programming requires brains"
 - res: "so came i guess in"
--------------------------------------------------------------------------------
I Exporting the model...

Now, my query is even if i have followed all the steps mentioned in documentation for training model for common voice dataset, Im not able to get good accuracy result. I just want to know is there anything wrong with my approach.
Thanks!

Please, we have had multiple releases of Common Voice, so this is not helping here.

The first wrong thing is that you keep not sharing as much as details as you should, and it is all spread over. I have to go back and forth to just get a picture of what you are doing. This is not helping at all.

Now, you expose 42% WER on Common Voice. This is mostly on-part with what we have and what other tools have on that dataset.

So please explain what exactly you mean. If that’s the WER of the test set, then I don’t see anything we can really improve as of now.

Not knowing your exact Common Voice versions is a huge problem here, because as much as I remember, we have some data of it in the 0.6.1 model, so you might just be overfitting on it.

How to know the common voice dataset version because I have downloaded it from https://voice.mozilla.org/en/datasets.
The other details are as,
cuda 10.2 and tensorflow 1.14. I have 2 Tesla K80 GPUs with 24 GiB.

The download has a release name. When did you downloaded ?

I had downloaded the common voice data using sudo wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-3/en.tar.gz

So I think cv version is 3.

Right, you still have not explained exactly what you meant by “Im not able to get good accuracy result.”, though.

I meant to say whenever I’m training a model I’m always getting the validation loss always greater than the training. and lastly Test result is not so much correct.

Please be clear there, I’ve already answered on that point.

Well, you have not shared any training log, so again, it is hard to help you there … How much do they diverge ? Is it constant ? Increasing ? Decreasing ? Have you tried other learning rate ? Dropout ?

I have this training log with droput rate 0.2 and learning rate 0.000001
Epoch 0
Training loss 20.814306
validation loss 37.094707

Epoch 1
Training loss 20.219907
validation loss 36.780804

Epoch 2
Training loss 19.969551
validation loss 36.636615

Epoch 3
Training loss 19.783321
validation loss 36.561840

I don’t see anything suspicious on that.

ok. So What can be the reason that testing results are not correct?

Test on /home/user/en/clips/test.csv - WER: 0.422665, CER: 0.253988, loss: 44.272888
--------------------------------------------------------------------------------
WER: 6.000000, CER: 3.222222, loss: 193.430511
 - wav: file:///home/user/en/clips/common_voice_en_54384.wav
 - src: "undefined"
 - res: "everything on her and he banterer "
--------------------------------------------------------------------------------
WER: 3.750000, CER: 3.882353, loss: 363.831543
 - wav: file:///home/user/en/clips/common_voice_en_17645060.wav
 - src: "did you know that"
 - res: "the two road the du know that did you know that they do know that did you know that"
--------------------------------------------------------------------------------
WER: 2.666667, CER: 0.655172, loss: 120.730492
 - wav: file:///home/user/en/clips/common_voice_en_125325.wav
 - src: "elizabeth reclined gracefully"
 - res: "it is a bet to an integrate full"
--------------------------------------------------------------------------------
WER: 2.285714, CER: 1.928571, loss: 343.015198
 - wav: file:///home/user/en/clips/common_voice_en_17832183.wav
 - src: "as you sow so shall you reap"
 - res: "i just she didn't fall it all over myself i just sit in front at all"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.000000, loss: 20.358313
 - wav: file:///home/user/en/clips/common_voice_en_191353.wav
 - src: "amen"
 - res: "the men"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.600000, loss: 21.027395
 - wav: file:///home/user/en/clips/common_voice_en_18442278.wav
 - src: "behave yourself"
 - res: "the head or self"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.785714, loss: 33.058380
 - wav: file:///home/user/en/clips/common_voice_en_17267925.wav
 - src: "any volunteers"
 - res: "in a woman to"
--------------------------------------------------------------------------------
WER: 1.833333, CER: 1.285714, loss: 141.307816
 - wav: file:///home/user/en/clips/common_voice_en_680693.wav
 - src: "find me the saga air cavalry"
 - res: "fin made the aga i will cover for time the saga a cabal"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 0.600000, loss: 42.782768
 - wav: file:///home/user/en/clips/common_voice_en_18429519.wav
 - src: "ideas are uncopyrightable"
 - res: "idea for an operator well"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 0.666667, loss: 45.606365
 - wav: file:///home/user/en/clips/common_voice_en_2421.wav
 - src: "programming requires brains"
 - res: "so came i guess in"
--------------------------------------------------------------------------------

Have you read my previous replies ? You have 42% WER on the test set of Common Voice, that’s within range of the literature.

As documented in the flags, the test report shows the top worst examples … Why are they that bad ? I don’t know, maybe those are broken transcripts ?

ok. Thanks, now Im getting somewhat. I have one more query regarding overfitting on common voice dataset using below command,

nohup ./DeepSpeech.py --n_hidden 2048 --learning_rate 0.0001 --checkpoint_dir /home/user/deepspeech-0.6.1-checkpoint/ --train_files /home/user/en/clips/train.csv --dev_files /home/user/en/clips/dev.csv --test_files /home/user/en/clips/test.csv --export_dir /home/user/modelexport  --use_cudnn_rnn --train_batch_size 100 --test_batch_size 100 --dev_batch_size 100 &

And the log is,
epoch starting time 2020-02-03 15:28:04.162653
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 8.183294
Epoch 0 |   Training | Elapsed Time: 0:00:05 | Steps: 2 | Loss: 8.716593
Epoch 0 |   Training | Elapsed Time: 0:00:08 | Steps: 3 | Loss: 8.281446
Epoch 0 |   Training | Elapsed Time: 0:00:10 | Steps: 4 | Loss: 8.427862
Epoch 0 |   Training | Elapsed Time: 0:00:12 | Steps: 5 | Loss: 8.701849
Epoch 0 |   Training | Elapsed Time: 0:00:15 | Steps: 6 | Loss: 8.790705
Epoch 0 |   Training | Elapsed Time: 0:00:17 | Steps: 7 | Loss: 8.693750
.
.
.
Epoch 0 |   Training | Elapsed Time: 0:32:18 | Steps: 308 | Loss: 19.248939
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /home/user/en/clips/dev.csv
Epoch 0 | Validation | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 24.301804 | Dataset: /home/user/en/clips/dev.csv
Epoch 0 | Validation | Elapsed Time: 0:00:06 | Steps: 2 | Loss: 27.545616 | Dataset: /user/en/clips/dev.csv
Epoch 0 | Validation | Elapsed Time: 0:00:08 | Steps: 3 | Loss: 26.931159 | Dataset: /home/en/clips/dev.csv
I Saved new best validating model with loss 38.864146 to: /home/user/deepspeech-0.6.1-checkpoint/best_dev-234092
epoch ending time 2020-02-03 16:05:47.882186
Total epoch time 0:37:43.719533
epoch starting time 2020-02-03 16:05:47.882206
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 1 |   Training | Elapsed Time: 0:00:04 | Steps: 1 | Loss: 9.052685
.
.
.
Epoch 1 | Validation | Elapsed Time: 0:05:20 | Steps: 63 | Loss: 39.579556 | Dataset: /home/user/en/clips/dev.csv
Epoch 1 | Validation | Elapsed Time: 0:05:20 | Steps: 63 | Loss: 39.579556 | Dataset: /home/user/en/clips/dev.csv

Like this till epoch 33 Im getting training loss always decreasing but the validation loss is always increasing.

That’s textbook overfitting. You need to do your own homework and adjust training hyper-parameters.