How to train common voice dataset portuguese - pt_29h_2019-12-10

Jerson_Luiz_de_Paula_Junior · May 31, 2020, 4:20am

It is now possible to sort the data set in Portuguese using deepspeech?

I read the document https://deepspeech.readthedocs.io/en/v0.7.1/TRAINING.html but I am confused how to train a model using the examples in the /bin folder, could someone give an initial help.

Using the dataset in Portuguese:

I have already executed the commands below, according to the documentation:

bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive

python3 DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv

othiele · May 31, 2020, 8:09am

Yes it works, but without any description of what went wrong I have no idea how to help you

Jerson_Luiz_de_Paula_Junior · June 2, 2020, 1:01pm

After the training or the result has nothing to do with Portuguese but with English, what did I do wrong?

Test on br/clips/test.csv - WER: 1.000000, CER: 0.699375, loss: 89.459923
--------------------------------------------------------------------------------
Best WER: 
--------------------------------------------------------------------------------
WER: 0.666667, CER: 0.615385, loss: 32.993889
 - wav: file://br/clips/common_voice_pt_19301257.wav
 - src: "daqui a pouco"
 **- res: "one a too"**
--------------------------------------------------------------------------------
WER: 0.666667, CER: 0.700000, loss: 32.972122
 - wav: file://br/clips/common_voice_pt_19287125.wav
 - src: "chanceler do tesouro"
 **- res: "i do so"**
--------------------------------------------------------------------------------
WER: 0.666667, CER: 0.692308, loss: 22.815779
 - wav: file://br/clips/common_voice_pt_19310535.wav
 - src: "abra a janela"
 **- res: "a ran"**
--------------------------------------------------------------------------------
WER: 0.750000, CER: 0.653846, loss: 90.590782
 - wav: file://br/clips/common_voice_pt_19334198.wav
 - src: "as praias eram exuberantes"
 - res: "as to er er "

My configuration:
python3 DeepSpeech.py --train_files br/clips/train.csv --dev_files br/clips/dev.csv --test_files br/clips/test.csv --train_batch_size 10 --dev_batch_size 10 --test_batch_size 10 --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.20 --epochs 75 --lm_alpha 0.75 --lm_beta 1.85 --export_dir export/ --checkpoint_dir export/ --export_language pt --alphabet_config_path alphabet.txt --scorer data/lm/kenlm.scorer

othiele · June 2, 2020, 1:17pm

Ah, you need a custom scorer with Portuguese text not the one for English. Search the forum on how to build a custom scorer.

Jerson_Luiz_de_Paula_Junior · June 2, 2020, 3:31pm

I did not find on the forum how to create a custom score, the documentation:https://deepspeech.readthedocs.io/en/master/Scorer.html explains how to create but I have some doubts.

Is it possible to create from a vocabulary of words or through the transcript of the tsv of the common voice dataset?

othiele · June 2, 2020, 6:20pm

Really? This was the first result of many I found:

Jerson_Luiz_de_Paula_Junior · June 2, 2020, 9:36pm

I have another problem now, any tips:

lmplz --order 2 --text vocabulary.txt --arpa words.arpa Works fine

Error:

> /opt/DeepSpeech/data/lm# python3 generate_package.py --alphabet /opt/DeepSpeech/data/alphabet.txt --lm lm.binary --vocab palavras.txt --package     kenlm.scorer --default_alpha 0.75 --default_beta 1.18

320037 unique words read from vocabulary file.
    Doesn't look like a character based model.
    Using detected UTF-8 mode: False
    Traceback (most recent call last):
      File "generate_package.py", line 153, in <module>
        main()
      File "generate_package.py", line 148, in main
        args.default_beta,
      File "generate_package.py", line 58, in create_bundle
        if err != ds_ctcdecoder.DS_ERR_SCORER_NO_TRIE:
    AttributeError: module 'ds_ctcdecoder' has no attribute 'DS_ERR_SCORER_NO_TRIE'

othiele · June 3, 2020, 7:40am

Really, please learn how to use the search in this forum. This is the second question that is answered by a one minute search.