Fine Tuning with Custom English Data(Very Small Size)

  • Mozilla STT version: DeepSpeech 0.9.3
  • OS: On Colab
  • Python 3.7.10
  • Tensorflow 1.15.2
  • Using GPU CUDA 10.0

I tried fine tuning my model by downloading the checkpoints and scorer model for v.0.9.3 by using these parameters.Splitting the data in ratio 8:1:1.But later on using the same dev and test files.So in ratio 8:2.

 !python3 DeepSpeech.py --train_cudnn True --early_stop True --es_epochs 3 –es_steps 5 --n_hidden 2048 --epochs 20 \
  --export_dir /content/models/  --checkpoint_dir /content/model_checkpoints/ \
  --train_files /content/train.csv --dev_files /content/intermediate.csv --test_files /content/intermediate.csv \
  --learning_rate 0.0001 --train_batch_size 24 --test_batch_size 48 --dev_batch_size 48 --export_file_name 'ft_model' \
  --augment reverb[p=0.2,delay=50.0~30.0,decay=10.0:2.0~1.0] \
  --augment volume[p=0.2,dbfs=-10:-40] \
  --augment pitch[p=0.2,pitch=1~0.2] \
  --augment tempo[p=0.2,factor=1~0.5] 
This gave me the following results:
    > I0405 02:04:46.870591 140206192527232 utils.py:157] NumExpr defaulting to 2 threads.
    > I Could not find best validating checkpoint.
    > I Could not find most recent checkpoint.
    > I Initializing all variables.
    > I STARTING Optimization
    > Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 645.115417      
    > Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 195.178772 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 195.178772 to: /content/model_checkpoints/best_dev-1
    > --------------------------------------------------------------------------------
    > Epoch 1 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 208.354904      
    > Epoch 1 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 124.489716 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 124.489716 to: /content/model_checkpoints/best_dev-2
    > --------------------------------------------------------------------------------
    > Epoch 2 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 151.153809      
    > Epoch 2 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 149.881775 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 3 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 184.461914      
    > Epoch 3 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 138.397873 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 4 |   Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 169.295624      
    > Epoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 113.872894 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 113.872894 to: /content/model_checkpoints/best_dev-5
    > --------------------------------------------------------------------------------
    > Epoch 5 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 141.762100      
    > Epoch 5 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 111.355629 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 111.355629 to: /content/model_checkpoints/best_dev-6
    > --------------------------------------------------------------------------------
    > Epoch 6 |   Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 124.472694      
    > Epoch 6 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 134.303055 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 7 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 144.363953      
    > Epoch 7 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 124.479218 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 8 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 131.589767      
    > Epoch 8 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 109.516235 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 109.516235 to: /content/model_checkpoints/best_dev-9
    > --------------------------------------------------------------------------------
    > Epoch 9 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 122.574684      
    > Epoch 9 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 104.244781 | Dataset: /content/intermediate.csv
    > I Saved new best validating model with loss 104.244781 to: /content/model_checkpoints/best_dev-10
    > --------------------------------------------------------------------------------
    > Epoch 10 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 120.229431     
    > Epoch 10 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 105.151703 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 11 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 124.279762     
    > Epoch 11 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 106.004166 | Dataset: /content/intermediate.csv
    > --------------------------------------------------------------------------------
    > Epoch 12 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 123.945976     
    > Epoch 12 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 105.062477 | Dataset: /content/intermediate.csv
    > I Early stop triggered as the loss did not improve the last 3 epochs
    > I FINISHED optimization in 0:05:06.340667
    > I Loading best validating checkpoint from /content/model_checkpoints/best_dev-10
    > I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
    > I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
    > I Loading variable from checkpoint: global_step
    > I Loading variable from checkpoint: layer_1/bias
    > I Loading variable from checkpoint: layer_1/weights
    > I Loading variable from checkpoint: layer_2/bias
    > I Loading variable from checkpoint: layer_2/weights
    > I Loading variable from checkpoint: layer_3/bias
    > I Loading variable from checkpoint: layer_3/weights
    > I Loading variable from checkpoint: layer_5/bias
    > I Loading variable from checkpoint: layer_5/weights
    > I Loading variable from checkpoint: layer_6/bias
    > I Loading variable from checkpoint: layer_6/weights
    > Testing model on /content/intermediate.csv
    > Test epoch | Steps: 1 | Elapsed Time: 0:00:12                                   
    > Test on /content/intermediate.csv - WER: 1.000000, CER: 0.880259, loss: 104.244766
    > --------------------------------------------------------------------------------
    > Best WER: 
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.884615, loss: 154.878494
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/07.wav
    >  - src: "brain is the highest coordinating centre in the body"
    >  - res: "      "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.860000, loss: 153.049408
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/09.wav
    >  - src: "energy required by an organism comes from the food"
    >  - res: "       "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 138.342255
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/25.wav
    >  - src: "enables the creation of cross platform program"
    >  - res: "e e   "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.875000, loss: 120.579895
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/04.wav
    >  - src: "upgrade changes in core system resources"
    >  - res: "      "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.916667, loss: 107.980042
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/06.wav
    >  - src: "dealloaction is completely automatic"
    >  - res: "     "
    > --------------------------------------------------------------------------------
    > Median WER: 
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 138.342255
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/25.wav
    >  - src: "enables the creation of cross platform program"
    >  - res: "e e   "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.875000, loss: 120.579895
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/04.wav
    >  - src: "upgrade changes in core system resources"
    >  - res: "      "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.916667, loss: 107.980042
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/06.wav
    >  - src: "dealloaction is completely automatic"
    >  - res: "     "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.880000, loss: 76.253258
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/08.wav
    >  - src: "liver secretes bile juice"
    >  - res: "    "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 68.422134
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/f521a5fd-3081-4c34-9c13-ed1e840925ea.wav
    >  - src: "pass me the salt bottle"
    >  - res: "   "
    > --------------------------------------------------------------------------------
    > Worst WER: 
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.916667, loss: 107.980042
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/06.wav
    >  - src: "dealloaction is completely automatic"
    >  - res: "     "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.880000, loss: 76.253258
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/08.wav
    >  - src: "liver secretes bile juice"
    >  - res: "    "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.869565, loss: 68.422134
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/f521a5fd-3081-4c34-9c13-ed1e840925ea.wav
    >  - src: "pass me the salt bottle"
    >  - res: "   "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.894737, loss: 61.059723
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/8178df6f-a65d-4745-862a-c6eb5b655d6d.wav
    >  - src: "it's time for lunch"
    >  - res: "  "
    > --------------------------------------------------------------------------------
    > WER: 1.000000, CER: 0.888889, loss: 57.637714
    >  - wav: file:///content/drive/MyDrive/audiowavfiles/7f7a8b09-01d8-451a-a425-4ca4da3322dc.wav
    >  - src: "she plays football"
    >  - res: "  "
    > --------------------------------------------------------------------------------

My csv file has 34 audio records of length 5 secs average by a single person in Indian accent female voice.The WER is 1.000 and loss is very high.I can not figure out where I am going wrong.
I have tried:

  • Using different dev and test sets but no major difference on results.

  • fine tuning with and without scorer model but WER remained same.

  • Different combinations of hyperparameters as suggested in official docs but results are almost the same .
    The questions I have:

  • Is it because I am using smaller amount of data?

  • I have a list of 100-200 commands that needs to be recognized exactly.How do I fine tune according to that?

  • What are the hyperparameters most suitable for this size of data,(epochs,steps etc)?

  • I have recorded the audio from WebDictaPhone App and changed their type from .oga to .wav using ffmpeg, and run it through python with subprocess.

  • The sample rate was 48000,which I changed to 16000 using pydub AudioSegment