Training a Mexican Deep Speech Model

alemol · September 3, 2019, 8:30pm

This topic started from issue #2306 where i posted some initial experiments to train a DS model for Mexican Spanish.

Until now my progress are

to gather some initial data from ciempiess corpus
to generate tools to format csv files for DS

cat ds_out/train.csv


    wav_filename,wav_filesize,transcript
  data/speech/male/M30ABR1342/CHMC_M_75_30ABR1342_0000.wav,130448,veamos e parece como lo de la invasión española
    data/speech/female/F30ABR1528/CHMC_F_75_30ABR1528_0000.wav,42398,hay fuego

My first run

python -u DeepSpeech.py \
  --train_files /home/amolina/repo/ciem2ds/ciempiess_ds/sortlen_half_train.csv \
  --test_files /home/amolina/repo/ciem2ds/ciempiess_ds/sortlen_all_test.csv \
  --alphabet_config_path data/mex_alphabet.txt \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 200 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

alemol · September 3, 2019, 8:56pm

@carlfm01

Hello @alemol I see that you are using ciempiess data, from my experience with ciempiess the data is not clean enough, wrong transcriptions lead to inf as your log shows.
See gt transcriptions “tiones”, “ciertas ac” that looks wrong.

Your train_batch_size is too low, try increasing it to 20, are you training on GPU?

Did you train a new lm? I see you didn’t use the lm param, maybe is falling back to the english one?

Try using data from http://www.openslr.org/resources.php the crowdsourced works for me

alemol · September 5, 2019, 5:20pm

I added my own lang model:

  python -u DeepSpeech.py \
  --train_files /home/amolina/repo/ciem2ds/ciempiess_ds/sortlen_half_train.csv \
  --test_files /home/amolina/repo/ciem2ds/ciempiess_ds/sortlen_all_test.csv \
  --alphabet_config_path data/mex_alphabet.txt \
  --lm_binary_path data/mexlm/transcrip_efinfo_noloc_2017-2018_probing.binary \
  --train_batch_size 2 \
  --test_batch_size 1 \
  --n_hidden 124 \
  --epochs 30 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

Got better:

Epoch 29 |   Training | Elapsed Time: 0:07:03 | Steps: 9297 | Loss: 66.886114                                                                                                                        
I FINISHED optimization in 3:32:14.989515
WARNING:tensorflow:From /home/amolina/deepvenv/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I Restored variables from most recent checkpoint at data/CIEMhalf_checkpoint/train-278910, step 278910
Testing model on /home/amolina/repo/ciem2ds/ciempiess_ds/sortlen_all_test.csv
Computing acoustic model predictions | Steps: 6974 | Elapsed Time: 0:01:56                                                                                                                           
Decoding predictions | 100% (6974 of 6974) |###################################################################################################################| Elapsed Time: 0:23:50 Time:  0:23:50
Test on /home/amolina/repo/ciem2ds/ciempiess_ds/sortlen_all_test.csv - WER: 0.911197, CER: 0.754889, loss: 154.287247
--------------------------------------------------------------------------------
WER: 2.500000, CER: 12.000000, loss: 51.920681
 - src: "después evolucionó"
 - res: "de que con con lo"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 7.394387
 - src: "pintando"
 - res: "en tanto"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 5.000000, loss: 9.625606
 - src: "talando"
 - res: "a la"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 10.330667
 - src: "esclavos"
 - res: "es la"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 11.659359
 - src: "entonces"
 - res: "en donde"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 5.000000, loss: 12.737014
 - src: "soldados"
 - res: "son las"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 5.000000, loss: 15.502726
 - src: "tiones"
 - res: "yo me"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 15.514644
 - src: "okey"
 - res: "o que"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 6.000000, loss: 17.840826
 - src: "círculo"
 - res: "si no"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 22.000000, loss: 56.092628
 - src: "concientización magníficamente desentendimiento"
 - res: "con sentido la mexicana sentimiento"
--------------------------------------------------------------------------------

carlfm01 · September 5, 2019, 5:54pm

--lm_trie_path is missing, did you trained the lm with accents á,é,ó? I think is only using a-z without accents from the english trie

Are you just learning how to use DS or trying to build a usable model?

alemol · September 5, 2019, 7:07pm

Are you just learning how to use DS or trying to build a usable model?

I am trying both learning how to use DS and then trying to build a usable model. I just want to be sure that i have all the necessary before train seriously.

I will add the --lm_trie_path but is easier if you tell me what else is missing . Thanks

carlfm01 · September 6, 2019, 7:33pm

Please read :

github.com

mozilla/DeepSpeech/blob/master/util/flags.py

from __future__ import absolute_import, division, print_function

import os
import absl.flags

FLAGS = absl.flags.FLAGS

def create_flags():
    # Importer
    # ========

    f = absl.flags

    f.DEFINE_string('train_files', '', 'comma separated list of files specifying the dataset used for training. Multiple files will get merged. If empty, training will not be run.')
    f.DEFINE_string('dev_files', '', 'comma separated list of files specifying the dataset used for validation. Multiple files will get merged. If empty, validation will not be run.')
    f.DEFINE_string('test_files', '', 'comma separated list of files specifying the dataset used for testing. Multiple files will get merged. If empty, the model will not be tested.')

    f.DEFINE_string('feature_cache', '', 'path where cached features extracted from --train_files will be saved. If empty, caching will be done in memory and no files will be written.')

    f.DEFINE_integer('feature_win_len', 32, 'feature extraction audio window length in milliseconds')

This file has been truncated. show original

Here’s one of my old commands

./DeepSpeech.py \
 --beam_width 1024 \
 --train_files /yourpath/train.csv \
 --dev_files /yourpath/dev.csv \
 --test_files /yourpath/test.csv \
 --train_batch_size 20 \
 --dev_batch_size 48 \
 --test_batch_size 48 \
 --n_hidden 2048 \
 --epochs 15 \
 --report_count 900000 \
 --earlystop_nsteps 1 \
 --dropout_rate 0.11 \
 --early_stop True \
 --learning_rate 0.0001 \
 --lm_alpha 0.75 \
 --lm_beta 2.2 \
 --export_dir /yourpath/result/models-tl \
 --checkpoint_dir /yourpath/result/ckpts-tl2 \
 --alphabet_config_path /yourpath/langmodel/alphabet.txt \
 --lm_binary_path /yourpath/langmodel/lm.binary \
 --lm_trie_path/yourpath/langmodel/trie

And sorry, if you need to build a production model you will need a good set of GPUs.