Fine Tuning with Custom English Data(Very Small Size)

Shalini_NA · April 5, 2021, 5:31am

Mozilla STT version: DeepSpeech 0.9.3
OS: On Colab
Python 3.7.10
Tensorflow 1.15.2
Using GPU CUDA 10.0

I tried fine tuning my model by downloading the checkpoints and scorer model for v.0.9.3 by using these parameters.Splitting the data in ratio 8:1:1.But later on using the same dev and test files.So in ratio 8:2.

 !python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 3 –es_steps 5 --n_hidden 2048 --epochs 20 \
  --export_dir /content/models/  --checkpoint_dir /content/model_checkpoints/ \
  --train_files /content/train.csv --dev_files /content/intermediate.csv --test_files /content/intermediate.csv \
  --learning_rate 0.0001 --train_batch_size 24 --test_batch_size 48 --dev_batch_size 48 --export_file_name 'ft_model' \
  --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
  --augment volume[p=0.2,dbfs=-10:-40] \
  --augment pitch[p=0.2,pitch=1~0.2] \
  --augment tempo[p=0.2,factor=1~0.5]

This gave me the following results:
    > I0405 02:04:46.870591 140206192527232 utils.py:157] NumExpr defaulting to 2 threads.
    > I Could not find best validating checkpoint.
    > I Could not find most recent checkpoint.
    > I Initializing all variables.
    > I STARTING Optimization
    > Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 645.115417      
    > Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 195.178772 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 195.178772 to: /content/model_checkpoints/best_dev-1
    > --------------------------------------------------------------------------------
    > Epoch 1 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 208.354904      
    > Epoch 1 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 124.489716 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 124.489716 to: /content/model_checkpoints/best_dev-2
    > --------------------------------------------------------------------------------
    > Epoch 2 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 151.153809      
    > Epoch 2 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 149.881775 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 3 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 184.461914      
    > Epoch 3 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 138.397873 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 4 |   Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 169.295624      
    > Epoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 113.872894 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 113.872894 to: /content/model_checkpoints/best_dev-5
    > --------------------------------------------------------------------------------
    > Epoch 5 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 141.762100      
    > Epoch 5 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 111.355629 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 111.355629 to: /content/model_checkpoints/best_dev-6
    > --------------------------------------------------------------------------------
    > Epoch 6 |   Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 124.472694      
    > Epoch 6 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 134.303055 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 7 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 144.363953      
    > Epoch 7 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 124.479218 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 8 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 131.589767      
    > Epoch 8 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 109.516235 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 109.516235 to: /content/model_checkpoints/best_dev-9
    > --------------------------------------------------------------------------------
    > Epoch 9 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 122.574684      
    > Epoch 9 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 104.244781 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 104.244781 to: /content/model_checkpoints/best_dev-10
    > --------------------------------------------------------------------------------
    > Epoch 10 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 120.229431     
    > Epoch 10 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 105.151703 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 11 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 124.279762     
    > Epoch 11 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 106.004166 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 12 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 123.945976     
    > Epoch 12 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 105.062477 | Dataset: /content/intermediate.csv
    > I Early stop triggered as the loss did not improve the last 3 epochs
    > I FINISHED optimization in 0:05:06.340667
    > I Loading best validating checkpoint from /content/model_checkpoints/best_dev-10
    > I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
    > I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
    > I Loading variable from checkpoint: global_step
    > I Loading variable from checkpoint: layer_1/bias
    > I Loading variable from checkpoint: layer_1/weights
    > I Loading variable from checkpoint: layer_2/bias
    > I Loading variable from checkpoint: layer_2/weights
    > I Loading variable from checkpoint: layer_3/bias
    > I Loading variable from checkpoint: layer_3/weights
    > I Loading variable from checkpoint: layer_5/bias
    > I Loading variable from checkpoint: layer_5/weights
    > I Loading variable from checkpoint: layer_6/bias
    > I Loading variable from checkpoint: layer_6/weights
    > Testing model on /content/intermediate.csv
    > Test epoch | Steps: 1 | Elapsed Time: 0:00:12                                   
    > Test on /content/intermediate.csv - WER: 1.000000, CER: 0.880259, loss: 104.244766
    > --------------------------------------------------------------------------------
    > Best WER: 
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.884615, loss: 154.878494
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/07.wav
    >  - src: "brain is the highest coordinating centre in the body"
    >  - res: "      "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.860000, loss: 153.049408
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/09.wav
    >  - src: "energy required by an organism comes from the food"
    >  - res: "       "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 138.342255
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/25.wav
    >  - src: "enables the creation of cross platform program"
    >  - res: "e e   "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.875000, loss: 120.579895
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/04.wav
    >  - src: "upgrade changes in core system resources"
    >  - res: "      "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.916667, loss: 107.980042
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/06.wav
    >  - src: "dealloaction is completely automatic"
    >  - res: "     "
    > --------------------------------------------------------------------------------
    > Median WER: 
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 138.342255
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/25.wav
    >  - src: "enables the creation of cross platform program"
    >  - res: "e e   "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.875000, loss: 120.579895
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/04.wav
    >  - src: "upgrade changes in core system resources"
    >  - res: "      "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.916667, loss: 107.980042
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/06.wav
    >  - src: "dealloaction is completely automatic"
    >  - res: "     "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.880000, loss: 76.253258
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/08.wav
    >  - src: "liver secretes bile juice"
    >  - res: "    "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 68.422134
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/f521a5fd-3081-4c34-9c13-ed1e840925ea.wav
    >  - src: "pass me the salt bottle"
    >  - res: "   "
    > --------------------------------------------------------------------------------
    > Worst WER: 
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.916667, loss: 107.980042
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/06.wav
    >  - src: "dealloaction is completely automatic"
    >  - res: "     "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.880000, loss: 76.253258
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/08.wav
    >  - src: "liver secretes bile juice"
    >  - res: "    "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 68.422134
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/f521a5fd-3081-4c34-9c13-ed1e840925ea.wav
    >  - src: "pass me the salt bottle"
    >  - res: "   "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.894737, loss: 61.059723
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/8178df6f-a65d-4745-862a-c6eb5b655d6d.wav
    >  - src: "it's time for lunch"
    >  - res: "  "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.888889, loss: 57.637714
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/7f7a8b09-01d8-451a-a425-4ca4da3322dc.wav
    >  - src: "she plays football"
    >  - res: "  "
    > --------------------------------------------------------------------------------

My csv file has 34 audio records of length 5 secs average by a single person in Indian accent female voice.The WER is 1.000 and loss is very high.I can not figure out where I am going wrong.
I have tried:

Using different dev and test sets but no major difference on results.
fine tuning with and without scorer model but WER remained same.
Different combinations of hyperparameters as suggested in official docs but results are almost the same .
The questions I have:
Is it because I am using smaller amount of data?
I have a list of 100-200 commands that needs to be recognized exactly.How do I fine tune according to that?
What are the hyperparameters most suitable for this size of data,(epochs,steps etc)?

Shalini_NA · April 5, 2021, 2:39am

I have recorded the audio from WebDictaPhone App and changed their type from .oga to .wav using ffmpeg, and run it through python with subprocess.
The sample rate was 48000,which I changed to 16000 using pydub AudioSegment