Custom Model returns wrong inferences using transfer learning

Training a custom model for Brasilian portguese language
Mozilla STT branch/version: master/DeepSpeech 0.9.3
OS Platform and Distribution : Linux Ubuntu 20.04)
Python version :3.6.9
TensorFlow version : tensorflow-estimator 1.15.1, tensorflow-gpu 1.15.4
CUDA version : Cuda V11.0
GPU model and memory: GeForce RTX 2060 and 6144 MB of memory

Hi,

I’m trying to train a new model using Deep Speech v0.9.3 for the Brazilian portuguese language. I’m using the Common Voice dataset in portuguese with 63 hours of voices. I followed the steps available in https://mozilla.github.io/deepspeech-playbook/. So I loaded and generate the .csv data. I’m using transfer learning because I changed the alphabet.txt (a pecualiarity of brasilian-portuguese language is he is rich in punctuation). Besides that, I added the location of the pre-trained model using the latest release checkpoints.
After that configurations, I ran the following command:

python3 DeepSpeech.py
–drop_source_layers 1
–train_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/train.csv
–dev_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/dev.csv
–test_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
–alphabet_config_path data/alphabet.txt
–load_checkpoint_dir deepspeech-data-pt-br/deepspeech-0.9.3-checkpoint/
–save_checkpoint_dir deepspeech-data-pt-br/checkpoints-pt-v1/
–epochs 11
–train_batch_size 1
–test_batch_size 1
–n_hidden 100
–learning_rate 0.000005
–dropout_rate 0.3

After completing the training, in the test phase of the training I arrived at the following results:

Testing model on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
Test epoch | Steps: 4626 | Elapsed Time: 2:55:13
Test on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv - WER: 0.990786, CER: 0.858851, loss: 125.119606

Best WER:

WER: 0.666667, CER: 0.875000, loss: 50.859680

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19343902.wav
  • src: “a reunião acabou”
  • res: " a"

WER: 0.666667, CER: 0.636364, loss: 41.534481

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19289366.wav
  • src: “e eu também”
  • res: “e e e”

WER: 0.666667, CER: 0.846154, loss: 34.967342

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19482367.wav
  • src: “paulo e pedro”
  • res: " e"

WER: 0.750000, CER: 0.813953, loss: 117.346085

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19839053.wav
  • src: “a economia está tremendo e afetando a china”
  • res: “a e aa”

WER: 0.750000, CER: 0.833333, loss: 69.561157

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_23010628.wav
  • src: “aconselhá los a cooperar”
  • res: " a"

Median WER:

WER: 1.000000, CER: 0.842105, loss: 122.082756

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19887850.wav
  • src: “minha viagem à bélgica foi um desastre”
  • res: “s e s”

WER: 1.000000, CER: 0.897436, loss: 121.977188

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19288338.wav
  • src: “primeiro você tem que assinar um recibo”
  • res: " "

WER: 1.000000, CER: 0.918919, loss: 121.896103

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19338093.wav
  • src: “tomando o ferry foi uma escolha sábia”
  • res: " a"

WER: 1.000000, CER: 0.850000, loss: 121.831039

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_21725171.wav
  • src: “ele fez um corte na lateral da embalagem”
  • res: " aaaa"

WER: 1.000000, CER: 0.804878, loss: 121.822968

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_20655700.wav
  • src: “o bombeiro não conseguiu resgatar o rapaz”
  • res: " aaa"

Worst WER:

WER: 1.000000, CER: 1.000000, loss: 11.466764

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22472394.wav
  • src: “sete”
  • res: “”

WER: 1.000000, CER: 1.000000, loss: 10.063250

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22019272.wav
  • src: “sim”
  • res: “”

WER: 1.000000, CER: 0.750000, loss: 8.606501

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_23915825.wav
  • src: “seis”
  • res: “s”

WER: 2.000000, CER: 2.000000, loss: 30.992125

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22206796.wav
  • src: “não”
  • res: " e e "

WER: 2.000000, CER: 1.333333, loss: 29.169136

  • wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19276313.wav
  • src: “olá”
  • res: " o aa"

As can be seen in ‘res’ returns a wrong inference or in others cases a blank inference. I can’t see in which point I am wrong. Would be in the number of epochs to more? Or Am I forgot any hyperparameteṛ? Or Am I defining any hyperparameter wrong?

Anyone could help me in this problem?

You probably want to be using transfer learning. I trained a model for Portuguese. You can find the necessary files here.

The results I got were:


That is, a CER of 20% with a language model.

I’m happy to go through the training scheme, please feel free to join on Mozilla’s Matrix.

For info, the hyperparameters I used were:

  • 100 epochs
  • Learning rate 0.001
  • Dropout 0.2
  • SpecAugment
    • frequency_mask[p=0.8,n=2:4,size=2:4]
    • time_mask[p=0.8, n=2:4, size=10:50, do-main=spectrogram]

You’ll also want to change your batch sizes, as big as can fit on the GPU (in my case 8). Note that you probably don’t want to add punctuation to the alphabet of the acoustic model. If you need punctuation it’s better to add it later in a postprocessing step.

Thanks for your response. I cannot access your training files. I have been working on this model for a few months. I managed to reach a hyper parameterization that starts to recognize some words spoken in the audio but the result is still not satisfactory. My current WER is 84%. I can’t seem to get a low WER (like your 20% for example). I have a dataset mixed into clean audios (provided by me) and clean/noisy audios (provided by Mozilla Common Voice). But in my experiments when I used my model and I record audio via microphone in a noisy environment my model doesn’t work very well. My question for you is how did you handle speech recognition in a noisy environment? How long is your dataset? how many hours of audio do you have? did you build your own scorer and used it in the training phase?

1 Like

If possible can you provide your train files again for me to analizer and compare with my train files?

Here are the checkpoints: https://itml.cl.indiana.edu/models/pt/.