Fine-tuning DeepSpeech Model (CommonVoice-DATA)

Hi, i am trying to produce a model that will manage to transcribe properly large files (e.g. 10m of BBC-news). Firstly, i thought to train my own model with CommonVoice last release but i realized that it demands many hours of training… So, i decided to fine-tune the existing model (v0.5.1) that has already been trained on CommonVoice data.
I am using the same hyperparameters as reported at the last release. However, i am getting worse results (WER ~ 80%) from the first exported model after early-stopping, while Mozilla released model achieves WER 55%. Let me refer that i am using streaming api (/examples/ffmpeg_vad_streaming) for decoding the large files.
My question is: could i ever get a better trained model ? Is it reasonable getting worse performance after fine-tuning with the same hyperparameters? Has anybody tried something similar?

1 Like

Please make sure you reproduce with C++ client first, it’s possible this example does extra processing that interferes.

With C++ client do you mean /DeepSpeech/native_client/deepspeech --model... --alphabet ... --lm ... --trie ... command? When i run this command the results are worse too (in comparison to mozilla release model). I’ve been searching for streaming api and i was informed in previous post that i should write code. Is there any example for english?
Also, training is being continued but now i see inf train/dev loss (1st & 2nd ep) and i wonder if it’s reasonable…

About VAD example i have noticed that it’s pretty much the same inference. I mean that it may use some extra parameters but i don’t thing it can affect the results in a large degree. I think probably something is going wrong with fine-tuning or maybe i have to spend more training hours.

There are examples, I’m not really sure I understand your point here.

Yes, well, at some point, without more context on how you train, it’s hard …

  --train_files /home/christina/PycharmProjects/DeepSpeech1/cv_model/train.csv \
  --dev_files /home/christina/PycharmProjects/DeepSpeech1/cv_model/dev.csv \
  --test_files /home/christina/PycharmProjects/DeepSpeech1/cv_model/test.csv \
  --train_batch_size 24 \
  --dev_batch_size 48 \
  --test_batch_size 48 \
  --dropout_rate 0.15 \
  --epochs 30 \
  --validation_step 1 \
  --report_count 20 \
  --early_stop True \
  --learning_rate 0.0001\
  --export_dir /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/checkpoints \
  --checkpoint_dir /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/checkpoints \
  --alphabet_config_path /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/alphabet.txt \
  --lm_binary_path /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/lm.binary \
  --lm_trie_path /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/trie \

I have already mentioned that i used the same hyperparameter with the released model 0.5.1 and the checkpoints. I train on gpu (GeForce RTX - 10989MiB) with CommonVoice validated data. Let me know if you need more information.

You did, but you also mentionned fine-tuning without any information regarding what dataset you are using to do that.

In your command line, what does train.csv, test.csv and dev.csv are? Common Voice dataset ?

Yes, all my .csv files contain CommonVoice data (from the last release - 774 validated hours).

Okay, so are you aware this is not good practice, your model will have alreday learnt some of those data ?

If i understand your question, you mean that last released model was trained on Common Voice dataset so when i fine-tune with the same data i get worse results??

Yes, it was documented in v0.4 releases notes, looks like it’s not in the latest ones though.

Ok, never mind, turns out that in this release we did not.

Try to lower the learning rate more

Ok, i would try with lr=0.00005 and i will report my results. Thank you for the quick responses!!

1 Like

The problem remains… By the 4th epoch everything seems normal (train & val loss ~ 25% and are being decreased gradually). However, at the end of 4th epoch suddenly train loss = inf and then training process stops by early-stopping while val_loss = inf. I tried to fine-tune again using the recent checkpoints but loss is stuck to inf. What is going wrong, does anyone have any idea?

  --train_files /home/christina/PycharmProjects/DeepSpeech1/cv_model/train.csv \
  --dev_files /home/christina/PycharmProjects/DeepSpeech1/cv_model/dev.csv \
  --test_files /home/christina/PycharmProjects/DeepSpeech1/cv_model/test.csv \
  --train_batch_size 64 \
  --dev_batch_size 64 \
  --test_batch_size 64 \
  --dropout_rate 0.15 \
  --epochs 30 \
  --validation_step 1 \
  --report_count 20 \
  --early_stop True \
  --learning_rate 0.00005\
  --export_dir /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/checkpoints \
  --checkpoint_dir /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/checkpoints \
  --alphabet_config_path /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/alphabet.txt \
  --lm_binary_path /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/lm.binary \
  --lm_trie_path /home/christina/PycharmProjects/DeepSpeech1/mozilla-models/trie \

I have also tried lr = 0.00001 / 0.001 / 0.01 and lower batch size but still the same problem with inf loss…

in some cases, some had more success with learning rate at 1e-6

This is usually a sign that the learning rate is too high. Try lowering by a factor of 10 say.

Indeed lr = 1e-6 solved the problem! I haven’t seen inf loss since i lowered the value. Thank you!

@ctzogka where you able to complete the fine tuning of 0.5.1 pre-trained model with CommonVoice data? How do you see the results w.r.t WER? Is there a specific accent you are trying to target using the new model?