Custom Model returns wrong inferences using transfer learning

kabat · May 12, 2021, 8:17pm

Training a custom model for Brasilian portguese language
Mozilla STT branch/version: master/DeepSpeech 0.9.3
OS Platform and Distribution : Linux Ubuntu 20.04)
Python version :3.6.9
TensorFlow version : tensorflow-estimator 1.15.1, tensorflow-gpu 1.15.4
CUDA version : Cuda V11.0
GPU model and memory: GeForce RTX 2060 and 6144 MB of memory

Hi,

I’m trying to train a new model using Deep Speech v0.9.3 for the Brazilian portuguese language. I’m using the Common Voice dataset in portuguese with 63 hours of voices. I followed the steps available in https://mozilla.github.io/deepspeech-playbook/. So I loaded and generate the .csv data. I’m using transfer learning because I changed the alphabet.txt (a pecualiarity of brasilian-portuguese language is he is rich in punctuation). Besides that, I added the location of the pre-trained model using the latest release checkpoints.
After that configurations, I ran the following command:

python3 DeepSpeech.py
–drop_source_layers 1
–train_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/train.csv
–dev_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/dev.csv
–test_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
–alphabet_config_path data/alphabet.txt
–load_checkpoint_dir deepspeech-data-pt-br/deepspeech-0.9.3-checkpoint/
–save_checkpoint_dir deepspeech-data-pt-br/checkpoints-pt-v1/
–epochs 11
–train_batch_size 1
–test_batch_size 1
–n_hidden 100
–learning_rate 0.000005
–dropout_rate 0.3

After completing the training, in the test phase of the training I arrived at the following results:

Testing model on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
Test epoch | Steps: 4626 | Elapsed Time: 2:55:13
Test on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv - WER: 0.990786, CER: 0.858851, loss: 125.119606

Best WER:

WER: 0.666667, CER: 0.875000, loss: 50.859680

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19343902.wav
src: “a reunião acabou”
res: " a"

WER: 0.666667, CER: 0.636364, loss: 41.534481

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19289366.wav
src: “e eu também”
res: “e e e”

WER: 0.666667, CER: 0.846154, loss: 34.967342

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19482367.wav
src: “paulo e pedro”
res: " e"

WER: 0.750000, CER: 0.813953, loss: 117.346085

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19839053.wav
src: “a economia está tremendo e afetando a china”
res: “a e aa”

WER: 0.750000, CER: 0.833333, loss: 69.561157

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_23010628.wav
src: “aconselhá los a cooperar”
res: " a"

Median WER:

WER: 1.000000, CER: 0.842105, loss: 122.082756

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19887850.wav
src: “minha viagem à bélgica foi um desastre”
res: “s e s”

WER: 1.000000, CER: 0.897436, loss: 121.977188

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19288338.wav
src: “primeiro você tem que assinar um recibo”
res: " "

WER: 1.000000, CER: 0.918919, loss: 121.896103

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19338093.wav
src: “tomando o ferry foi uma escolha sábia”
res: " a"

WER: 1.000000, CER: 0.850000, loss: 121.831039

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_21725171.wav
src: “ele fez um corte na lateral da embalagem”
res: " aaaa"

WER: 1.000000, CER: 0.804878, loss: 121.822968

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_20655700.wav
src: “o bombeiro não conseguiu resgatar o rapaz”
res: " aaa"

Worst WER:

WER: 1.000000, CER: 1.000000, loss: 11.466764

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22472394.wav
src: “sete”
res: “”

WER: 1.000000, CER: 1.000000, loss: 10.063250

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22019272.wav
src: “sim”
res: “”

WER: 1.000000, CER: 0.750000, loss: 8.606501

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_23915825.wav
src: “seis”
res: “s”

WER: 2.000000, CER: 2.000000, loss: 30.992125

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22206796.wav
src: “não”
res: " e e "

WER: 2.000000, CER: 1.333333, loss: 29.169136

wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19276313.wav
src: “olá”
res: " o aa"

As can be seen in ‘res’ returns a wrong inference or in others cases a blank inference. I can’t see in which point I am wrong. Would be in the number of epochs to more? Or Am I forgot any hyperparameteṛ? Or Am I defining any hyperparameter wrong?

Anyone could help me in this problem?

ftyers · May 13, 2021, 1:03am

You probably want to be using transfer learning. I trained a model for Portuguese. You can find the necessary files here.

The results I got were:

That is, a CER of 20% with a language model.

I’m happy to go through the training scheme, please feel free to join on Mozilla’s Matrix.

For info, the hyperparameters I used were:

100 epochs
Learning rate 0.001
Dropout 0.2
SpecAugment
- frequency_mask[p=0.8,n=2:4,size=2:4]
- time_mask[p=0.8, n=2:4, size=10:50, do-main=spectrogram]

You’ll also want to change your batch sizes, as big as can fit on the GPU (in my case 8). Note that you probably don’t want to add punctuation to the alphabet of the acoustic model. If you need punctuation it’s better to add it later in a postprocessing step.

kabat · February 21, 2022, 6:53am

Thanks for your response. I cannot access your training files. I have been working on this model for a few months. I managed to reach a hyper parameterization that starts to recognize some words spoken in the audio but the result is still not satisfactory. My current WER is 84%. I can’t seem to get a low WER (like your 20% for example). I have a dataset mixed into clean audios (provided by me) and clean/noisy audios (provided by Mozilla Common Voice). But in my experiments when I used my model and I record audio via microphone in a noisy environment my model doesn’t work very well. My question for you is how did you handle speech recognition in a noisy environment? How long is your dataset? how many hours of audio do you have? did you build your own scorer and used it in the training phase?

kabat · February 21, 2022, 7:24am

If possible can you provide your train files again for me to analizer and compare with my train files?

ftyers · February 21, 2022, 7:41pm

Here are the checkpoints: https://itml.cl.indiana.edu/models/pt/.

Custom Model returns wrong inferences using transfer learning

Testing model on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv Test epoch | Steps: 4626 | Elapsed Time: 2:55:13 Test on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv - WER: 0.990786, CER: 0.858851, loss: 125.119606

Best WER:

Median WER:

Worst WER:

Testing model on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
Test epoch | Steps: 4626 | Elapsed Time: 2:55:13
Test on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv - WER: 0.990786, CER: 0.858851, loss: 125.119606