It is now possible to sort the data set in Portuguese using deepspeech?
I read the document https://deepspeech.readthedocs.io/en/v0.7.1/TRAINING.html but I am confused how to train a model using the examples in the /bin folder, could someone give an initial help.
Using the dataset in Portuguese:
I have already executed the commands below, according to the documentation:
bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive
python3 DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv
othiele
(Olaf Thiele)
May 31, 2020, 8:09am
2
Yes it works, but without any description of what went wrong I have no idea how to help you
1 Like
After the training or the result has nothing to do with Portuguese but with English, what did I do wrong?
Test on br/clips/test.csv - WER: 1.000000, CER: 0.699375, loss: 89.459923
--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 0.666667, CER: 0.615385, loss: 32.993889
- wav: file://br/clips/common_voice_pt_19301257.wav
- src: "daqui a pouco"
**- res: "one a too"**
--------------------------------------------------------------------------------
WER: 0.666667, CER: 0.700000, loss: 32.972122
- wav: file://br/clips/common_voice_pt_19287125.wav
- src: "chanceler do tesouro"
**- res: "i do so"**
--------------------------------------------------------------------------------
WER: 0.666667, CER: 0.692308, loss: 22.815779
- wav: file://br/clips/common_voice_pt_19310535.wav
- src: "abra a janela"
**- res: "a ran"**
--------------------------------------------------------------------------------
WER: 0.750000, CER: 0.653846, loss: 90.590782
- wav: file://br/clips/common_voice_pt_19334198.wav
- src: "as praias eram exuberantes"
- res: "as to er er "
My configuration:
python3 DeepSpeech.py --train_files br/clips/train.csv --dev_files br/clips/dev.csv --test_files br/clips/test.csv --train_batch_size 10 --dev_batch_size 10 --test_batch_size 10 --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.20 --epochs 75 --lm_alpha 0.75 --lm_beta 1.85 --export_dir export/ --checkpoint_dir export/ --export_language pt --alphabet_config_path alphabet.txt --scorer data/lm/kenlm.scorer
othiele
(Olaf Thiele)
June 2, 2020, 1:17pm
4
Ah, you need a custom scorer with Portuguese text not the one for English. Search the forum on how to build a custom scorer.
1 Like
I did not find on the forum how to create a custom score, the documentation:https://deepspeech.readthedocs.io/en/master/Scorer.html explains how to create but I have some doubts.
Is it possible to create from a vocabulary of words or through the transcript of the tsv of the common voice dataset?
othiele
(Olaf Thiele)
June 2, 2020, 6:20pm
6
Really? This was the first result of many I found:
Hi! I am using DeepSpeech 0.7.0 alpha2. I also receive the error when I am trying generate_package.py
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
I don’t understand what is that -v argument that you are talking about and where I should place it. Or whatever is the problem
I performed the following steps so far:
path/lmplz --…
I have another problem now, any tips:
lmplz --order 2 --text vocabulary.txt --arpa words.arpa Works fine
Error:
> /opt/DeepSpeech/data/lm# python3 generate_package.py --alphabet /opt/DeepSpeech/data/alphabet.txt --lm lm.binary --vocab palavras.txt --package kenlm.scorer --default_alpha 0.75 --default_beta 1.18
320037 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Traceback (most recent call last):
File "generate_package.py", line 153, in <module>
main()
File "generate_package.py", line 148, in main
args.default_beta,
File "generate_package.py", line 58, in create_bundle
if err != ds_ctcdecoder.DS_ERR_SCORER_NO_TRIE:
AttributeError: module 'ds_ctcdecoder' has no attribute 'DS_ERR_SCORER_NO_TRIE'
othiele
(Olaf Thiele)
June 3, 2020, 7:40am
8
Really, please learn how to use the search in this forum. This is the second question that is answered by a one minute search.