Does anyone got a good result when training the Common Voice data set?

I use the Command:
./DeepSpeech.py --train_files ./data/CV/cv-valid-train.csv --dev_files ./data/CV/cv-valid-dev.csv --test_files ./data/CV/cv-valid-test.csv --checkpoint_dir ../checkPoint512-45/ --export_dir ./export_model/ --n_hidden 512 --epoch 30 --validation_step 2 --train_batch_size 45 --dev_batch_size 15 --test_batch_size 5
, in which I use 512 hidden units, and only use the cv-valid-train.csv file as the training set.
The training process early stops at the epoch 17.
But the result is not very good:
WER: 0.408375, loss: 33.786406912, mean edit distance: 0.223059

Does anyone get a better result? Could you give me some advice on tuning parameters?

Sorry @jackhuang I don’t know but I’d also be interested to know. The question I came here to ask was if there are any benchmark results for deepspeech and commonvoice?

@jackhuang did you ever get your model trained on that single GPU? How long did it take to train in the end? How many epochs?

It seems that the common voice test set is difficult to be recognized. And I haven’t found some benchmark results yet.

Yes. For every epoch, 800 hours data will need 6 hour to train approximately. The number of epochs is hard to know. It depends on the data and the parameters.

Thanks @jackhuang… very helpful. I guess I have deepspeech training right now (on a single GPU AWS instance) so I guess we’ll see how that goes and I will report back here as to accuracy and number of epochs were require to reach that level.

Do you think the common voice data set is hard to train because the quality of the recordings is variable?

Thanks!!

Some benchmarks: https://github.com/Franck-Dernoncourt/ASR_benchmark#benchmark-results

Are those results on the test set?

The best I could reach so far was loss: 18.341875 on the training set with the command:

CUDA_VISIBLE_DEVICES=1,2,3 unbuffer python DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv --dev_files data/common-voice-v1/cv-valid-train.csv --test_files data/common-voice-v1/cv-valid-train.csv --log_level 0 --train_batch_size 20 --train True --decoder_library_path ./libctc_decoder_with_kenlm.so --checkpoint_dir cv007 --export_dir cv007export --summary_dir cv007 summaries --summary_secs 600 --wer_log_pattern "GLOBAL LOG: logwer('${COMPUTE_ID}', '%s', '%s', %f)" --learning-rate 0.0001 |& tee -a cv007.log

Loss on training set for each epoch:

Line 12969: I Training of Epoch 0 - loss: inf
Line 25854: I Training of Epoch 1 - loss: 46.667273
Line 38739: I Training of Epoch 2 - loss: 33.400887
Line 51624: I Training of Epoch 3 - loss: 25.859372
Line 64511: I Training of Epoch 4 - loss: inf
Line 77396: I Training of Epoch 5 - loss: 18.341875
Line 95289: I Training of Epoch 6 - loss: inf
Line 113487: I Training of Epoch 7 - loss: inf
Line 130967: I Training of Epoch 8 - loss: inf
Line 149278: I Training of Epoch 9 - loss: inf
Line 168374: I Training of Epoch 10 - loss: inf
Line 192899: I Training of Epoch 11 - loss: inf
Line 216609: I Training of Epoch 12 - loss: inf
Line 246423: I Training of Epoch 13 - loss: inf
Line 271818: I Training of Epoch 14 - loss: inf
Line 301303: I Training of Epoch 15 - loss: inf
Line 333113: I Training of Epoch 16 - loss: inf
Line 366976: I Training of Epoch 17 - loss: inf

Not great, I’m also looking for some better hyperparameters.

2 Likes

Have you tried increasing nhidden? The number in the original post seems too small. The default for the network on repo is 2048.

The language model will also help improve the prediction of the RNN. I believe the beam search will scan the top 1024 words in the vocabulary which optimize the chain probability of a prediction being correct. There are also some pretty good default parameters in place which will appropriately weight the RNN prediction via CTC loss and the beam search from the language model.