Fine-tuning DeepSpeech Model (CommonVoice-DATA)

Hi @sranjeet.visteon!
I haven’t complete fine-tuning yet… Let me explain my plan:
I am trying to build a model that will support large files decoding such as BBC news (10 minutes). In these clips there is noise (e.g. phone conversation), pronunciation (e.g. Hindi, African e.t.c) and sometimes intermediate music parts. So i have been using CV data to fine-tune 0.5.1 (do not include training on CV) in order to boost accuracy. Firstly, WER = 58% and now WER = 51% on these clips. On CV test set there is also a significant improvement: from WER=44% to WER=22%!

Hi @ctzogka

did you use the parameter in your previous posts (epochs 30, lr 1e-6, drop_out 0.15, … ) to get those results ?

Interesting usecase you got here, do you have a git where I can follow your progresses ? I have something similar in mind :slight_smile:

Thanks !

Hello @caucheteux!
I used all the parameters i referred before. However, currently i am fine-tuning with lr = 5e-1 / 95e-1. I noticed that WER increased for my BBC-news test clip when i set lr = 95e-1. On the contary, WER decreased for CV-test set for the same value.
I don’t have any git but I will gradually report my results here for your orientation.

1 Like

My advise for anyone who is doing something similar is to examine lr values in order to find the appropriate.
"A large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights. A smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train."
It’s safe starting from 1e-6 and continue increasing when results get better. Always check train & val loss, if they are close you are doing good job (no over/under-fitting)!

1 Like

hi @lissyx,

I have a question. As the model from v0.5.1 is not trained from CV dataset, is it possible that some sentences are not in the lm provided ?
I found some absurdities such as “he the man” instead of “amen”, is it a lm problem or just a model problem ?

@ctzogka
After a 20 epochs, lr = 1e-6, droprate = 0.15, finetuning, dropout layer = 1 and trained on CV dataset, I got 48% WER, same as original model. Do I miss something somewhere ? Note that I add to force no earlystop. But no Overfitting (train_loss ~38.0 dev_los ~41)

Sorry if it’s not clear :slight_smile:
Thanks !

No, the LM is built from Wikipedia, not from CV.

Then it is a model-related problem, thanks !

HI @lissyx

I have this kind of output after one training (LR 1e-5, dropout 0.15, Finetuning, epoch 9, dropout_layer 1, Train & Test on CV, WER 46.78%):

  • src: “common voice”
  • res: “kind common voice slashed title”

I find this weird as it recognize “common voice” but put it in a sentence. My first hint will be to “blame” the lm… Do I guess it wrong ?

Maybe 9 is a bit short. What’s your loss?

I Saved new best validating model with loss 39.072543 to: ../deepspeech-checkpoint_TL2/best_dev-483448
I Early stop triggered as (for last 4 steps) validation loss: 39.072543 with standard deviation: 0.185010 and mean: 39.483036

Here the output after train and dev, as you can see, it’s the early stop who make the 9 epochs. Do I have to force it to go to 15-20 epochs ?

I got some very weird output after test, below is the worst 10 WER I got :

Test on ../data/en/clips/test.csv - WER: 0.467797, CER: 0.296113, loss: 45.341286
--------------------------------------------------------------------------------
WER: 3.500000, CER: 63.000000, loss: 258.041626
 - src: "did you know that"
 - res: "the tune that they know that to you now that they know that the you know that"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 9.000000, loss: 20.913012
 - src: "amen"
 - res: "yes the man"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 11.000000, loss: 29.868282
 - src: "kettledrums"
 - res: "cats will draw"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 13.566093
 - src: "nosiree"
 - res: "no there"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 10.000000, loss: 23.384727
 - src: "kettledrums"
 - res: "i ran"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 11.000000, loss: 38.728931
 - src: "medley hotchpotch"
 - res: "may i eat pots"
--------------------------------------------------------------------------------
WER: 1.750000, CER: 16.000000, loss: 66.706467
 - src: "that's an inherent disadvantage"
 - res: "the them and earn as a vantage"
--------------------------------------------------------------------------------
WER: 1.750000, CER: 21.000000, loss: 79.642654
 - src: "undefined reference to 'swizzle'"
 - res: "and i find you are into sir"
--------------------------------------------------------------------------------
WER: 1.714286, CER: 43.000000, loss: 237.652634
 - src: "as you sow so shall you reap"
 - res: "i just sit in partial over myself i just sit in facial"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 13.000000, loss: 39.023445
 - src: "vanilla cupcakes forever"
 - res: "the nile cut for ever"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 16.000000, loss: 57.100410
 - src: "also inspector gadget"
 - res: "as he spoke he tore"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 24.000000, loss: 72.677155
 - src: "elizabeth reclined gracefully"
 - res: "it is a basilica for"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 6.000000, loss: 13.087830
 - src: "itching palm"
 - res: "i in part"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 10.000000, loss: 25.016457
 - src: "that's admirable"
 - res: "that is my"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 10.000000, loss: 53.508671
 - src: "eta eleventhirty"
 - res: "looky eleven times"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 17.000000, loss: 61.997990
 - src: "christmas eve at midnight"
 - res: "i met you had been light"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 19.000000, loss: 98.405342
 - src: "common voice"
 - res: "kind common voice slashed title"
--------------------------------------------------------------------------------
WER: 1.428571, CER: 28.000000, loss: 85.970360
 - src: "the family that prays together stays together"
 - res: "only the trees to give a state to be in a"
--------------------------------------------------------------------------------
WER: 1.333333, CER: 2.000000, loss: 17.977991
 - src: "forewarned is forearmed"
 - res: "for warned is fore armed"
--------------------------------------------------------------------------------
WER: 1.333333, CER: 7.000000, loss: 20.886923
 - src: "aren't we godlike"
 - res: "and in god like"
--------------------------------------------------------------------------------

I believe this is the problem too. my test loss is around 3 and train loss is under 1 but i have similar output as you.

--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.250000, loss: 0.589752
 - src: "amen"
 - res: "men"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.100000, loss: 5.420970
 - src: "a steakandkidney pie"
 - res: "a steak and kidney pie"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.050000, loss: 5.953032
 - src: "topsyturvy steamboat"
 - res: "topsy turvy steamboat"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.583333, loss: 22.851200
 - src: "common voice"
 - res: "the common voice to"

I’ve yet to experiment with the alphaa and beta parameters. any insights?

hey @alchemi5t,

weird that our loss is not the same, could you share your parameters ? After multiple try, it seems that I don’t go under 35~40 of loss. I don’t know why, I tried to force more epochs but it began to overfit…
I miss something here, but don’t know what…

I didn’t try to modify alpha and beta parameters, as I thought Mozilla Team optimize them

That’s not really possible. how would they know what kind of post processing behaviour everyone would want.

the only thing i can think of right now that’s different for us is the dropoutrate. Mine is 25%. 2048 n_hidden. and LR i modified it throughout the training. I don’t early stop. I also dont let it overfit, which i didnt see happening. Post your config(pre-format it), lemme see if i can spot anything.

True

Here my last try parameters :

python -u DeepSpeech.py 
--fine_tune True
--export_dir .../MyModel_TL2/ 
--summary_dir ../summaries/TL2/ 
--checkpoint_dir ../deepspeech-checkpoint 
--train_files ../data/en/clips/train.csv 
--dev_files ../data/en/clips/dev.csv 
--test_files ../data/en/clips/test.csv 
--lm_binary_path ../deepspeech_models_0.5.1/lm.binary
--lm_trie_path ../deepspeech_models_0.5.1/trie
--alphabet_config_path ../deepspeech_models_0.5.1/alphabet.txt  
--load best
--drop_source_layers 1 
--source_model_checkpoint_dir ../deepspeech-0.5.1-checkpoint/ 
--n_hidden 2048 
--early_stop False 
--train_batch_size 32 
--dev_batch_size 32 
--test_batch_size 32 
--learning_rate 0.00005
--dropout_rate 0.15
--epochs 30 
--validation_step 1 
--report_count 20 
--es_std_th 0.1 
--es_mean_th 0.1 
--es_steps 20

I train on CV and test also on CV. WER is 47.12

I remark that some parameters don’t work, did I wrote them wrong ?

Those will have an impact on the training, and you likely have to do your own search for the best fit with your data / goal

i am not familiar with a few of your flags. Anywho, like i said, the only obvious difference i see is the batchsize(mine is 64), and the rest I’ve already posted here. for training from scratch my lr is 0.0001. when you see overfitting, then cut down.

¿Why drop layers? If you are targeting the same language that the model was trained on, basically you are “losing knowledge” of the previous training.

Looks like your model will overfit, I’m doing transfer learning with 500h and even dropping 2 layers is enough with 1 epoch, with more epochs the WER increased.

Try with

–epochs 1

1 Like

thanks @carlfm01 and @alchemi5t,

Seems like I get mixed up between TL to another language and TL for the same language.

I’ll try to clean my parameters, see what is useful, what is not…

I don’t think the batchsize has an impact on precision but rather on time

This is not accurate. Check this out

After one and a half years, I come back to my answer because my previous answer was wrong.
Batch size impacts learning significantly. What happens when you put a batch through your network is that you average the gradients. The concept is that if your batch size is big enough, this will provide a stable enough estimate of what the gradient of the full dataset would be. By taking samples from your dataset, you estimate the gradient while reducing computational cost significantly. The lower you go, the less accurate your esttimate will be, however in some cases these noisy gradients can actually help escape local minima. When it is too low, your network weights can just jump around if your data is noisy and it might be unable to learn or it converges very slowly, thus negatively impacting total computation time.
Another advantage of batching is for GPU computation, GPUs are very good at parallelizing the calculations that happen in neural networks if part of the computation is the same (for example, repeated matrix multiplication over the same weight matrix of your network). This means that a batch size of 16 will take less than twice the amount of a batch size of 8.
In the case that you do need bigger batch sizes but it will not fit on your GPU, you can feed a small batch, save the gradient estimates and feed one or more batches, and then do a weight update. This way you get a more stable gradient because you increased your virtual batch size.
WRONG, OLD ANSWER: [[[No, the batch_size on average only influences the speed of your learning, not the quality of learning. The batch_sizes also don’t need to be powers of 2, although I understand that certain packages only allow powers of 2. You should try to get your batch_size the highest you can that still fits the memory of your GPU to get the maximum speed possible.]]]]

credit:-

Theoretically, you’re only right when every sample conforms to a strict set of rules and the network is guaranteed to converge to that function, which is (sweepingly) never the case.

Correct me if I am wrong to believe this is accurate.

That is interesting. I want to know what TL to another language behaves like. Because if you dont get stuck at any local minimas(unlikely), you would still converge, except maybe faster/slower. any observations?