DeepSpeech Training Problems for Brazilian Portuguese

Hi,

I am trying to train the DeepSpeech model for Brazilian Portuguese,
In Brazilian Portuguese there are few datasets available (here a work which used 14 hours of speech).

I was able to get a 109 hour dataset in Brazilian Portuguese and I am trying to train DeepSpeech in this dataset (the dataset is spontaneous speaking and was collected from sociolinguistic interviews and was completely manually transcribed by humans)

For creating LM and trie I followed the documentation recommendations:
I created words.arpa with the following command (RawText.txt contains all the transcripts (but the wav file paths have been removed from this file):

./lmplz --text ../../datasets/ASR-Portuguese-Corpus-V1/RawText.txt --arpa /tmp/words.arpa --order 5 --temp_prefix /tmp/

I generated lm.binary:
kenlm/build/bin/build_binary -a 255 -q 8 trie lm.arpa lm.binary

I installed the native client:
python util/taskcluster.py --arch gpu --target native_client --branch v0.6.0

I created the file alphabet.txt with the following:

`# Each line in this file represents the Unicode codepoint (UTF-8 encoded)# associated with a numeric label.# A line that starts with # is a comment. You can escape it with # if you wish # to use ‘#’ as a label.

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ç
ã
à
á
â
ê
é
í
ó
ô
õ
ú
û`

After I generated the trie:
DeepSpeech/native_client/generate_trie ../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt lm.binary trie

After I trained the model with the following command:

  --train_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_train.csv \
  --checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \
  --test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \
  --alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \
  --lm_binary_path  ../../datasets/deepspeech-data/lm.binary \
  --lm_trie_path ../../datasets/deepspeech-data/trie \
  --train_batch_size 2 \
  --test_batch_size 2 \
  --dev_batch_size 2 \
  --export_batch_size 2 \
  --epochs 200 \
  --early_stop False \

Previously I trained the model with early_stop (specifying dev_files), however the model stopped training after 4 epochs, so I removed the early stop. Both the 50 and 4 epochs models have the same results.
I run the test using the following command:

python evaluate.py \
  --checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \
  --test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \
  --alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \
  --lm_binary_path  ../../datasets/deepspeech-data/lm.binary \
  --lm_trie_path ../../datasets/deepspeech-data/trie 

The result was:

INFO:tensorflow:Restoring parameters from ../deepspeech_v6-0-0/checkpoints/train-2796891
I0102 09:06:41.871738 139898013472576 saver.py:1280] Restoring parameters from ../deepspeech_v6-0-0/checkpoints/train-2796891
I Restored variables from most recent checkpoint at ../deepspeech_v6-0-0/checkpoints/train-2796891, step 2796891
Testing model on ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                                                                                                                  2020-01-02 09:06:42.344339: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2020-01-02 09:06:42.384953: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-01-02 09:06:42.537285: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
Test epoch | Steps: 199 | Elapsed Time: 0:01:28                                                                                                                                                
Test on ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv - WER: 0.956973, CER: 0.852231, loss: 101.685509
--------------------------------------------------------------------------------
WER: 4.000000, CER: 2.333333, loss: 54.671597
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/53999_nurc_.wav
 - src: "lá "
 - res: "e a e a "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.666667, loss: 32.827530
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/17216_nurc_.wav
 - src: "revistas "
 - res: "e a "
--------------------------------------------------------------------------------
WER: 1.200000, CER: 0.739130, loss: 79.709518
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/60600_nurc_.wav
 - src: "num não me animo muito "
 - res: "e a a a a a "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 8.319281
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/33267_sp_.wav
 - src: "é "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 11.219957
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/37622_sp_.wav
 - src: "né "
 - res: "e"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 11.632010
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/29378_nurc_.wav
 - src: "é "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 12.242241
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/37172_nurc_.wav
 - src: "é "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 13.220651
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/62827_sp_.wav
 - src: "não "
 - res: "e"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 14.941595
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/844_nurc_.wav
 - src: "mas "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 14.989404
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/22739_sp_.wav
 - src: "uhn "
 - res: "e "
--------------------------------------------------------------------------------

The model often transcribes the letter “e”, the use of this letter is very frequent in the dataset.

Am I doing something wrong?

How can I check if my lm.binarry and trie are correct?

Does anyone have any suggestions?

Best Regards,

It seems an obvious case of “model has learnt nothing”. The fact that early stop triggered so soon is also a hint.

You likely have to adapt hyper-parameters to your dataset.

1 Like

In particular with so little data I would start by reducing n_hidden dramatically. Try 1024, 768, 512.

2 Likes

For the language model, the OSCAR dataset has 64GB of Portuguese text: https://traces1.inria.fr/oscar/

2 Likes

Thanks so much for your reply :), I will soon recreate the language model.

At the moment I mapped the accents from Portuguese to 'letter (example mapping: ç ->'c, the letter á -> 'a). After I was able to use transfer learning from the pre-trained English model and I’m in epoch 27 with early stop and the model continues to decrease the loss. I had already done something similar in Voice Synthesis, and models that did not converge on a small base in Portuguese began to converge. When the training is over I will update them on the result.

Nice, that’s really cool! We’ve attempted some transfer learning experiments with the English model before but nothing that worked very well with only a hundred hours of data or so. If this works for you it’d be great to know.

1 Like

Hi, with transfer learning I didn’t get a good result (Test on …/…/datasets/ASR-Portuguese-Corpus-V1/metadata_test_200-tl.csv - WER: 0.749439, CER: 0.467458, loss: 60.035717), but it was better than the previous one. The model hardly learns accents (it is a difficult task since an ’ in front of the letter means another letter/accentuation). The model wrongly predicts most transcripts.

In the next version of DeepSpeech could you train the English model with all the accents of Portuguese and Spanish?

I believe this would help in transfer learning in these languages.

@edresson1, how have you been doing with your training?