Hi ya’ll,
Getting Fatal Python error: Segmentation fault when I try to load custom lm.scorer using 0.7.2 model. I’d appreciate any help just debugging the way that I generate the custom scorer.
Just to prove the point here’s things working in one_shot_infer model with the kenlm.scorer.
docker run --rm -it --runtime=nvidia -u 1000:1000 -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/train 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/train-gpu:latest python3 -u DeepSpeech/DeepSpeech.py \
--scorer_path 'kenlm.scorer' \
--alphabet_config_path '../data/lm/base_encoder/alphabet.txt' \
--checkpoint_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints \
--summary_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/summaries \
--one_shot_infer 'test.wav'
I Loading best validating checkpoint from /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints/best_dev-29322
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
**ka ohia kaoha te hai me na papoa o taou**
–
But when I use my own custom lm.scorer we get a segmentation fault
docker run --rm -it --runtime=nvidia -u 1000:1000 -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/train 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/train-gpu:latest python3 -u DeepSpeech/DeepSpeech.py \
--scorer_path '../data/lm/base_encoder/lm.scorer' \
--alphabet_config_path '../data/lm/base_encoder/alphabet.txt' \
--checkpoint_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints \
--summary_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/summaries \
--one_shot_infer 'test.wav'
I Loading best validating checkpoint from /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints/best_dev-29322
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Fatal Python error: Segmentation fault
Current thread 0x00007f8b75afc740 (most recent call first):
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/swigwrapper.py", line 361 in ctc_beam_search_decoder
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 83 in ctc_beam_search_decoder
File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 882 in do_single_file_inference
File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 934 in main
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250 in _run_main
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299 in run
File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 939 in run_script
File "DeepSpeech/DeepSpeech.py", line 12 in <module>
makefile:252: recipe for target 'test_infer' failed
make: *** [test_infer] Error 139
So… yeah. All the same files are used as input - except the lm.scorer - so clearly I messed up in the way that I generated lm.scorer.
But how can validate/debug the way I created lm.scorer.?
The way that I generated the lm.scorer was a two step process, specifically:
Step 1: generate_lm.py
(Note I added a ‘skip_symbols’ flag which just passes through to jmplz)
docker run --rm -u root:root -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/lm 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/lm:latest bash -c "\
python ../train/DeepSpeech/data/lm/generate_lm.py \
--input_txt ../data/lm_data/base_encoder/mi.languagemodel.train \
--output_dir ../data/lm/base_encoder \
--top_k 500000 --kenlm_bins /usr/lib/kenlm/build/bin \
--arpa_order 5 --max_arpa_memory 85% \
--arpa_prune \"0|0|1\" \
--binary_a_bits 22 \
--binary_q_bits 8 \
--binary_type trie \
--skip_symbols"
/ |# | 0 Elapsed Time: 0:00:00
- | # | 4227 Elapsed Time: 0:00:00
\ | # | 7665 Elapsed Time: 0:00:00
| | # | 10772 Elapsed Time: 0:00:00
/ | # | 14001 Elapsed Time: 0:00:00
- | # | 17636 Elapsed Time: 0:00:00
\ | # | 21241 Elapsed Time: 0:00:00
| | # | 24475 Elapsed Time: 0:00:00
/ | # | 27582 Elapsed Time: 0:00:00
- | # | 31930 Elapsed Time: 0:00:00
\ | # | 36426 Elapsed Time: 0:00:01
| | # | 41842 Elapsed Time: 0:00:01
/ | # | 46935 Elapsed Time: 0:00:01
- | # | 51195 Elapsed Time: 0:00:01
\ | # | 56235 Elapsed Time: 0:00:01
| | # | 61203 Elapsed Time: 0:00:01
/ | # | 66306 Elapsed Time: 0:00:01
- | # | 70995 Elapsed Time: 0:00:01
\ | # | 74065 Elapsed Time: 0:00:01
| | # | 76518 Elapsed Time: 0:00:01
/ | # | 79157 Elapsed Time: 0:00:02
- | # | 81801 Elapsed Time: 0:00:02
\ | # | 84373 Elapsed Time: 0:00:02
| | # | 87087 Elapsed Time: 0:00:02
/ | # | 89796 Elapsed Time: 0:00:02
- | # | 92274 Elapsed Time: 0:00:02
\ | # | 94870 Elapsed Time: 0:00:02
| | # | 97559 Elapsed Time: 0:00:02
/ | # | 100601 Elapsed Time: 0:00:02
- | # | 103380 Elapsed Time: 0:00:02
\ | # | 105982 Elapsed Time: 0:00:03
| | # | 108831 Elapsed Time: 0:00:03
/ | # | 111836 Elapsed Time: 0:00:03
- | # | 115060 Elapsed Time: 0:00:03
\ | # | 118063 Elapsed Time: 0:00:03
| | # | 121231 Elapsed Time: 0:00:03
/ | # | 124511 Elapsed Time: 0:00:03
- | # | 127809 Elapsed Time: 0:00:03
\ | # | 131152 Elapsed Time: 0:00:03
| | # | 134441 Elapsed Time: 0:00:03
/ | # | 138640 Elapsed Time: 0:00:04
- | # | 143307 Elapsed Time: 0:00:04
\ | # | 147760 Elapsed Time: 0:00:04
| | # | 152628 Elapsed Time: 0:00:04
/ | # | 158752 Elapsed Time: 0:00:04
- | #| 162960 Elapsed Time: 0:00:04
\ | #| 166859 Elapsed Time: 0:00:04
| | # | 171256 Elapsed Time: 0:00:04
| | # | 172986 Elapsed Time: 0:00:04
=== 1/5 Counting and sorting n-grams ===
Reading /work/waha-tuhi/data/lm/base_encoder/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Warning: <s> appears in the input. All instances of <s>, </s>, and <unk> will be interpreted as whitespace.
****************************************************************************************************
Unigram tokens 2978778 types 29560
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:354720 2:5339543040 3:10011643904 4:16018628608 5:23360501760
Statistics:
1 29560 D1=0.639748 D2=1.02063 D3+=1.43701
2 307654 D1=0.700202 D2=1.07513 D3+=1.42814
3 256739/1010772 D1=0.795858 D2=1.13648 D3+=1.42481
4 243294/1706845 D1=0.875455 D2=1.22982 D3+=1.43663
5 158645/2060538 D1=0.908299 D2=1.26413 D3+=1.37111
Memory estimate for binary LM:
type kB
probing 22469 assuming -p 1.5
probing 27317 assuming -r models -p 1.5
trie 11048 without quantization
trie 5976 assuming -q 8 -b 8 quantization
trie 9927 assuming -a 22 array pointer compression
trie 4855 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:354720 2:4922464 3:5134780 4:5839056 5:4442060
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:354720 2:4922464 3:5134780 4:5839056 5:4442060
=== 5/5 Writing ARPA model ===
Name:lmplz VmPeak:53646748 kB VmRSS:22456 kB RSSMax:9517180 kB user:4.444 sys:5.044 CPU:9.49127 real:8.12519
Reading ../data/lm/base_encoder/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Reading ../data/lm/base_encoder/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
Converting to lowercase and counting word occurrences ...
Saving top 500000 words ...
Calculating word statistics ...
Your text file has 3324752 words in total
It has 29559 unique words
Your top-500000 words are 100.0000 percent of all words
Your most common word "te" occurred 292223 times
The least common word in your top-k is "kātoa" with 1 times
The first word with 2 occurrences is "uke" at place 16509
Creating ARPA file ...
Filtering ARPA file using vocabulary of top-k words ...
Building lm.binary ...
Second step is making the lm.scorer…
Step 2 : Make lm.scorer
docker run --rm -u root:root -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/lm 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/lm:latest bash -c "\
python ../train/DeepSpeech/data/lm/generate_package.py \
--alphabet ../data/lm/base_encoder/alphabet.txt \
--lm ../data/lm/base_encoder/lm.binary \
--vocab ../data/lm/base_encoder/mi_vocab.txt \
--package ../data/lm/base_encoder/lm.scorer \
--default_alpha 1.47 \
--default_beta 3.49 \
--force_utf8=true"
16508 unique words read from vocabulary file.
Doesn't look like a character based model.
Package created in ../data/lm/base_encoder/lm.scorer
Clearly the lm.scorer is no dang good but
- can anyone see what I might have done wrong?
- is there a way to validate or debug the process of creating it?