everything is working fine with given default configs.(i.e.training, prediction)… It only breaks when i update vocabulary.txt
So verify / share how you build it. Please look at data/lm
, it should be self-contained and drive you to a working LM.
steps i did to generate LM:=
- get alphabet.txt and add custom words.
../kenlm/build/bin/lmplz --discount_fallback -o 3 <mirrorfit.txt> mirrorfit.arpa
=== 1/5 Counting and sorting n-grams ===
Reading /home/rbeigcn1134841d/Desktop/mark1/mfit-models/mirrorfit.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 200003 types 200006
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:2400072 2:9327010816 3:17488144384
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Statistics:
1 200006 D1=0.5 D2=1 D3+=1.5
2 400006 D1=0.5 D2=1 D3+=1.5
3 200003 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type kB
probing 17969 assuming -p 1.5
probing 21094 assuming -r models -p 1.5
trie 10718 without quantization
trie 7864 assuming -q 8 -b 8 quantization
trie 10132 assuming -a 22 array pointer compression
trie 7278 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:2400072 2:6400096 3:4000060
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:2400072 2:6400096 3:4000060
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:26364176 kB VmRSS:22528 kB RSSMax:6075308 kB user:0.858606 sys:1.26977 CPU:2.12839 real:2.06948
../kenlm/build/bin/build_binary -T -s mirrorfit.arpa mirrorfit.binary
Reading mirrorfit.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
- …/DeepSpeech/generate_trie alphabet.txt mirrorfit.binary trie
Is the flow and outputs correct ??
It does not looks like you are passing the correct arguments to build_binary
.
Please reply and provide the feedback on the other items I asked you to check.
I was following this tutorial " How I trained a specific french model to control my robot
creating binary file :
/bin/bin/./build_binary -T -s words.arpa lm.binary
He was building it in the same way, please tell me otherwise what params to pass.
Also ,
If i use lm.binary in default package and my trie its giving core-dump. But if I use my binaries and trie given in the default package its working. Not sure why?
Because you keep insisting on :
- not listening to what I am telling you
- refuse to give us feedback on updating the
ds_ctcdecoder
package - don’t use proper documentation that I already linked you to.
I will stop helping you until you actually read and act on what I aksed earlier.
I am really sorry, forgot to inform i did ran
pip install --upgrade $(python util/taskcluster.py --decoder)
but issue still persists.
I am continuously referring data/lm as well the
TUTORIAL : How I trained a specific french model to control my robot
to generate the language model. May be i am missing some small things, just not able to get that.
Wait, can we avoid confusion and get the whole picture ? It’s completely unclear what you are doing now.
Can you cross-check and share pip list | grep ds_ctcdecoder
as well as git describe --tags
?
Do you have the crash with the default language model / trie ? Since you failed to share proper status at first, I assumed you had a mismatch …
Please, read doc and script. Don’t refer to anything else.
pip list | grep ds_ctcdecoder
ds-ctcdecoder 0.6.1
git describe --tags
v0.6.1-35-g94882fb
Yes default lm.binary and trie are working perfectly fine
Ok will check the generate_lm script and see the docs.
Weird. If you are on v0.6.1
, you should not have that tag. This shows you are on master
, so you’re going to have troubles if you don’t stick to matching versions.
I will checkout that tag and will let you know
Now i have matched versions
pip list | grep ds_ctcdecoder
ds-ctcdecoder 0.6.1
git describe --tags
v0.6.1
Now also after generating the LM, I am getting this error while training on checkpoint with my added vocabulary and my LM binary and trie
cmd=>
python3 DeepSpeech.py \
--train_files /home/Downloads/indian_train.csv \
--dev_files /home/Downloads/indian_dev.csv \
--test_files /home/Downloads/indian_test.csv \
--n_hidden 2048 \
--train_batch_size 20 \
--dev_batch_size 10 \
--test_batch_size 10 \
--epochs 1 \
--learning_rate 0.0001 \
--export_dir /home/Desktop/mark3/trieModel/ \
--checkpoint_dir /home/Desktop/mark3/DeepSpeech/deepspeech-0.6.1-checkpoint/ \
--cudnn_checkpoint /home/Desktop/mark3/DeepSpeech/deepspeech-0.6.1-checkpoint/ \
--alphabet_config_path /home/Desktop/mark3/mfit-models/alphabet.txt \
--lm_binary_path /home/Desktop/mark3/mfit-models/lm.binary \
--lm_trie_path /home/Desktop/mark3/mfit-models/trie \
Error during training after it ends doing dev, while checking the test.csv error is coming
I Restored variables from best validation checkpoint at /home/Desktop/mark3/DeepSpeech/deepspeech-0.6.1-checkpoint/best_dev-234353, step 234353
Testing model on /home/Downloads/indian_test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00 Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Segmentation faultSegmentation faultSegmentation fault
Segmentation faultThread 0x
Segmentation fault (core dumped)
There’s something wrong in your ctc decoder setup / trie production that is broken…
Can you please tell us exactly how you proceed ? I’m really starting to loose patience here.
What are the sizes of:
- your
vocabulary.txt
file - your
lm.binary
file - your
trie
file
Can you ensure you used exactly the same alphabet file ?
Maybe there’s something bogus in your dataset.
vocabulary.txt = 1.7MB
lm.binary = 20.1MB
trie = 80 Bytes
So you failed at generating the trie file. Since you have not yet shared how you do that, we can’t help you …