Training a custom model for Brasilian portguese language
Mozilla STT branch/version: master/DeepSpeech 0.9.3
OS Platform and Distribution : Linux Ubuntu 20.04)
Python version :3.6.9
TensorFlow version : tensorflow-estimator 1.15.1, tensorflow-gpu 1.15.4
CUDA version : Cuda V11.0
GPU model and memory: GeForce RTX 2060 and 6144 MB of memory
Hi,
I’m trying to train a new model using Deep Speech v0.9.3 for the Brazilian portuguese language. I’m using the Common Voice dataset in portuguese with 63 hours of voices. I followed the steps available in https://mozilla.github.io/deepspeech-playbook/. So I loaded and generate the .csv data. I’m using transfer learning because I changed the alphabet.txt (a pecualiarity of brasilian-portuguese language is he is rich in punctuation). Besides that, I added the location of the pre-trained model using the latest release checkpoints.
After that configurations, I ran the following command:
python3 DeepSpeech.py
–drop_source_layers 1
–train_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/train.csv
–dev_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/dev.csv
–test_files deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
–alphabet_config_path data/alphabet.txt
–load_checkpoint_dir deepspeech-data-pt-br/deepspeech-0.9.3-checkpoint/
–save_checkpoint_dir deepspeech-data-pt-br/checkpoints-pt-v1/
–epochs 11
–train_batch_size 1
–test_batch_size 1
–n_hidden 100
–learning_rate 0.000005
–dropout_rate 0.3
After completing the training, in the test phase of the training I arrived at the following results:
Testing model on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv
Test epoch | Steps: 4626 | Elapsed Time: 2:55:13
Test on deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/test.csv - WER: 0.990786, CER: 0.858851, loss: 125.119606
Best WER:
WER: 0.666667, CER: 0.875000, loss: 50.859680
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19343902.wav
- src: “a reunião acabou”
- res: " a"
WER: 0.666667, CER: 0.636364, loss: 41.534481
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19289366.wav
- src: “e eu também”
- res: “e e e”
WER: 0.666667, CER: 0.846154, loss: 34.967342
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19482367.wav
- src: “paulo e pedro”
- res: " e"
WER: 0.750000, CER: 0.813953, loss: 117.346085
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19839053.wav
- src: “a economia está tremendo e afetando a china”
- res: “a e aa”
WER: 0.750000, CER: 0.833333, loss: 69.561157
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_23010628.wav
- src: “aconselhá los a cooperar”
- res: " a"
Median WER:
WER: 1.000000, CER: 0.842105, loss: 122.082756
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19887850.wav
- src: “minha viagem à bélgica foi um desastre”
- res: “s e s”
WER: 1.000000, CER: 0.897436, loss: 121.977188
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19288338.wav
- src: “primeiro você tem que assinar um recibo”
- res: " "
WER: 1.000000, CER: 0.918919, loss: 121.896103
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19338093.wav
- src: “tomando o ferry foi uma escolha sábia”
- res: " a"
WER: 1.000000, CER: 0.850000, loss: 121.831039
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_21725171.wav
- src: “ele fez um corte na lateral da embalagem”
- res: " aaaa"
WER: 1.000000, CER: 0.804878, loss: 121.822968
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_20655700.wav
- src: “o bombeiro não conseguiu resgatar o rapaz”
- res: " aaa"
Worst WER:
WER: 1.000000, CER: 1.000000, loss: 11.466764
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22472394.wav
- src: “sete”
- res: “”
WER: 1.000000, CER: 1.000000, loss: 10.063250
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22019272.wav
- src: “sim”
- res: “”
WER: 1.000000, CER: 0.750000, loss: 8.606501
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_23915825.wav
- src: “seis”
- res: “s”
WER: 2.000000, CER: 2.000000, loss: 30.992125
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_22206796.wav
- src: “não”
- res: " e e "
WER: 2.000000, CER: 1.333333, loss: 29.169136
- wav: file://deepspeech-data-pt-br/cv-corpus-6.1-2020-12-11/pt/clips/common_voice_pt_19276313.wav
- src: “olá”
- res: " o aa"
As can be seen in ‘res’ returns a wrong inference or in others cases a blank inference. I can’t see in which point I am wrong. Would be in the number of epochs to more? Or Am I forgot any hyperparameteṛ? Or Am I defining any hyperparameter wrong?
Anyone could help me in this problem?