Spanish: Blank inferences

I trained my model with the common voice dataset in Spanish for 10 epochs. The results both in the validation of the training and the inference that is obtained when executing:

deepspeech --model deepspeech-0.9.1-models.pbmm --scorer deepspeech-0.9.1-models.scorer --audio my_audio_file.wav

They return a blank result:

Example:

WER: 1.000000, CER: 0.864865, loss: 107.391472
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-1565384151948609.wav
 - src: "qué peleas se agarraban entre ustedes"
 - res: "        "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.852941, loss: 107.340851
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/common_voice_es_19602468.wav
 - src: "sentí que cada riff estaba escrito"
 - res: "          "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.852941, loss: 107.299416
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-1562620039745670.wav
 - src: "oyó a un grupo releyendo geografía"
 - res: "       "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.810811, loss: 107.287590
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/common_voice_es_19139609.wav
 - src: "en roma estuvo en el colegio de lieja"
 - res: "          "
--------------------------------------------------------------------------------
Worst WER: 
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.333333, loss: 21.902287
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-1557876943950223.wav
 - src: "non"
 - res: "    "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 20.615292
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-156452630840292.wav
 - src: "rossi"
 - res: "   "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 20.549049
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-1556823942669887.wav
 - src: "sisisi"
 - res: "  "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 17.611378
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-1565617749088932.wav
 - src: "enid"
 - res: "  "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.800000, loss: 17.374151
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/alpha_deepspeech/clips/archivo-1556197352940137.wav
 - src: "no no"
 - res: "  "
--------------------------------------------------------------------------------

Training execution line:

CUDA_VISIBLE_DEVICES=0 python3 DeepSpeech.py --train_files ~/training_audios/audios_entrenamiento/temp/deepspeech/clips/train.csv --dev_files ~/training_audios/audios_entrenamiento/temp/deepspeech/clips/dev.csv --test_files ~/training_audios/audios_entrenamiento/temp/deepspeech/clips/test.csv --automatic_mixed_precision --alphabet_config_path ~/train_deepspeech/alphabet.txt --checkpoint_dir ~/train_deepspeech/deepspeech/checkpoints --export_dir ~/train_deepspeech/deepspeech/checkpoints/export --log_level 0 --epochs 10 --limit_test 5000

Number dataset files:

train.csv: 256522
dev.csv: 28611
test.csv: 21574

Alphabet.txt:

a
á
à
â
ä
b
c
d
e
é
è
ê
ë
f
g
h
i
í
ì
î
ï
j
k
l
m
n
ñ
o
ó
ò
ô
ö
p
q
r
s
t
u
ú
ù
û
ü
v
w
x
y
z
!
¡
?
¿
´
¨

“blank space”

Environment:

  • deepspeech: 0.9.2
  • deepspeech-training: 0.9.2
  • OS Platform and Distribution: Ubuntnu 18.04
  • TensorFlow installed from: 1.15.4
  • TensorFlow version: 1.15.4
  • Python version: 3.6
  • CUDA/cuDNN version: CUDA Version 10.0.130 / CUDNN_MAJOR 7
  • GPU model and memory: Tesla v100 16 GB

Thanks Manuel for a well written post. Blanks can mean that it is not trained enough, but 10 epochs for that size should produce something.

  1. What about dropout and learning rate? Standard droput is not suitable. Maybe 0.3

  2. Did you build your own scorer or is this the English one?

  3. Don’t limit the test set, reduce it’s size if you want to.

  4. Reduce the alphabet to just Spanish letters. Maybe even just the English ones. The more letters, the more training material.

  5. Use a batch size for all 3 of 32 or 64. A V100 should be able to process that.

  6. Blank Space ist just a blank in the file, but doesn’t show here?

1 Like

You could also try to use the Spanish checkpoint+scorer from DeepSpeech-Polyglot project as basis, and run transfer-learning on top of it, if you want to keep your alphabet.

@dan.bmh how silly of me not to mention your models :slight_smile: Great idea. You’ll find an alphabet and a working scorer if you want to train on just your material.

@othiele I don’t have any scorer files. Is it really necessary?

I did the workouts with the recommendations you made me and I got these results:

CUDA_VISIBLE_DEVICES=0 python3 DeepSpeech.py --train_files ~/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/train.csv --dev_files ~/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/dev.csv --test_files ~/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/test.csv --automatic_mixed_precision --alphabet_config_path ~/train_deepspeech/alphabet.txt --checkpoint_dir ~/train_deepspeech/train_04_01_2021/checkpoints --export_dir ~/train_deepspeech/train_04_01_2021/checkpoints/export --log_level 0 --epochs 10 --dropout_rate 0.3 --train_batch_size 64 --dev_batch_size 64 --test_batch_size 64 --export_batch_size 64

and I get the following results:

--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.857143, loss: 30.235117
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21987879.wav
 - src: "firefox"
 - res: "o"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.833333, loss: 23.909798
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21942424.wav
 - src: "cuatro"
 - res: "oo"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 16.605837
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21961351.wav
 - src: "nueve"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 14.875109
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_22036789.wav
 - src: "siete"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 14.321535
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21983279.wav
 - src: "cinco"
 - res: ""
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 14.321535
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21983279.wav
 - src: "cinco"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 13.271736
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_22043319.wav
 - src: "tres"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 12.867823
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21886210.wav
 - src: "hey"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 12.829750
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21944989.wav
 - src: "cero"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 11.550548
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21962058.wav
 - src: "seis"
 - res: ""
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 11.550548
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21962058.wav
 - src: "seis"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 10.331042
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21944355.wav
 - src: "sí"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 9.695055
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21989345.wav
 - src: "dos"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 9.292615
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21939380.wav
 - src: "uno"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 5.831096
 - wav: file:///home/manuel_servex/training_audios/audios_entrenamiento/temp/servex_common_voice/clips/common_voice_es_21939330.wav
 - src: "no"
 - res: ""
--------------------------------------------------------------------------------

And in the last trainings it is always the same test files?

how do i build my own scorer?

  1. I am not sure what happens if you don’t use a scorer for testing, kind of defeats the purpose … Try inferencing without a scorer for a known/trained chunk.

  2. Why did you choose the export-batch of 64. I have never seen that. I advised for all 3, you used 4. Please try to understand what each parameter does, read some other posts, … This is not an end user product yet :slight_smile:

  3. What are the loss values at the end of each epoch like for train/dev? This should give an indication how the training went along.

  4. How many hours is your material or what is the mean? Looks like really short commands.

@othiele Thanks for your answer

  1. I did tests for the english model from the english repository in the scorer and everything worked correctly

  2. I chose it for simple tests, because of the results I was getting, but the results with or without export-batch of 64 are the same.

Loss:
Train: 13.899992
Dev: 15.962547
  1. But less than 350 hours

Please read the docs carefully and try to understand what the scorer does. It looks like you did not try the model for just inference without a scorer as I suggested?

Losses only make sense over time, single data points are not really helpful.

What is your use case?

@dan.bmh
I have downloaded the model that you have trained in Spanish from your repository, how can I test it?

There are multiple ways to test it. You can either follow the setup steps in DS-Polyglot, or use the DS testing script directly, which might be faster if you already have set up DS for training. You can also use the provided .pbmm and .scorer files for normal inference like you did in your first post.