Debugging/validating custom lm.scorer

utunga · June 4, 2020, 6:47am

Hi ya’ll,

Getting Fatal Python error: Segmentation fault when I try to load custom lm.scorer using 0.7.2 model. I’d appreciate any help just debugging the way that I generate the custom scorer.
Just to prove the point here’s things working in one_shot_infer model with the kenlm.scorer.

docker run  --rm -it --runtime=nvidia -u 1000:1000 -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/train 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/train-gpu:latest python3 -u DeepSpeech/DeepSpeech.py \
              --scorer_path                   'kenlm.scorer' \
              --alphabet_config_path    '../data/lm/base_encoder/alphabet.txt' \
              --checkpoint_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints \
              --summary_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/summaries  \
              --one_shot_infer 'test.wav'
    I Loading best validating checkpoint from /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints/best_dev-29322
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
    I Loading variable from checkpoint: layer_1/bias
    I Loading variable from checkpoint: layer_1/weights
    I Loading variable from checkpoint: layer_2/bias
    I Loading variable from checkpoint: layer_2/weights
    I Loading variable from checkpoint: layer_3/bias
    I Loading variable from checkpoint: layer_3/weights
    I Loading variable from checkpoint: layer_5/bias
    I Loading variable from checkpoint: layer_5/weights
    I Loading variable from checkpoint: layer_6/bias
    I Loading variable from checkpoint: layer_6/weights
    **ka ohia kaoha te hai me na papoa o taou**

–

But when I use my own custom lm.scorer we get a segmentation fault

docker run  --rm -it --runtime=nvidia -u 1000:1000 -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/train 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/train-gpu:latest python3 -u DeepSpeech/DeepSpeech.py \
          --scorer_path                   '../data/lm/base_encoder/lm.scorer' \
          --alphabet_config_path    '../data/lm/base_encoder/alphabet.txt' \
          --checkpoint_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints \
          --summary_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/summaries  \
          --one_shot_infer 'test.wav'
I Loading best validating checkpoint from /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints/best_dev-29322
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Fatal Python error: Segmentation fault

Current thread 0x00007f8b75afc740 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/swigwrapper.py", line 361 in ctc_beam_search_decoder
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 83 in ctc_beam_search_decoder
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 882 in do_single_file_inference
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 934 in main
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250 in _run_main
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299 in run
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 939 in run_script
  File "DeepSpeech/DeepSpeech.py", line 12 in <module>
makefile:252: recipe for target 'test_infer' failed
make: *** [test_infer] Error 139

So… yeah. All the same files are used as input - except the lm.scorer - so clearly I messed up in the way that I generated lm.scorer.

But how can validate/debug the way I created lm.scorer.?

The way that I generated the lm.scorer was a two step process, specifically:

Step 1: generate_lm.py

(Note I added a ‘skip_symbols’ flag which just passes through to jmplz)

docker run --rm  -u root:root -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/lm 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/lm:latest bash -c "\
python ../train/DeepSpeech/data/lm/generate_lm.py \
  --input_txt ../data/lm_data/base_encoder/mi.languagemodel.train \
  --output_dir ../data/lm/base_encoder \
  --top_k 500000 --kenlm_bins /usr/lib/kenlm/build/bin \
  --arpa_order 5 --max_arpa_memory 85% \
  --arpa_prune \"0|0|1\" \
  --binary_a_bits 22 \
  --binary_q_bits 8 \
  --binary_type trie \
  --skip_symbols"
/ |#                                                  | 0 Elapsed Time: 0:00:00
- | #                                              | 4227 Elapsed Time: 0:00:00
\ |  #                                             | 7665 Elapsed Time: 0:00:00
| |   #                                           | 10772 Elapsed Time: 0:00:00
/ |    #                                          | 14001 Elapsed Time: 0:00:00
- |     #                                         | 17636 Elapsed Time: 0:00:00
\ |      #                                        | 21241 Elapsed Time: 0:00:00
| |       #                                       | 24475 Elapsed Time: 0:00:00
/ |        #                                      | 27582 Elapsed Time: 0:00:00
- |         #                                     | 31930 Elapsed Time: 0:00:00
\ |          #                                    | 36426 Elapsed Time: 0:00:01
| |           #                                   | 41842 Elapsed Time: 0:00:01
/ |            #                                  | 46935 Elapsed Time: 0:00:01
- |             #                                 | 51195 Elapsed Time: 0:00:01
\ |              #                                | 56235 Elapsed Time: 0:00:01
| |               #                               | 61203 Elapsed Time: 0:00:01
/ |                #                              | 66306 Elapsed Time: 0:00:01
- |                 #                             | 70995 Elapsed Time: 0:00:01
\ |                  #                            | 74065 Elapsed Time: 0:00:01
| |                   #                           | 76518 Elapsed Time: 0:00:01
/ |                    #                          | 79157 Elapsed Time: 0:00:02
- |                     #                         | 81801 Elapsed Time: 0:00:02
\ |                      #                        | 84373 Elapsed Time: 0:00:02
| |                       #                       | 87087 Elapsed Time: 0:00:02
/ |                        #                      | 89796 Elapsed Time: 0:00:02
- |                         #                     | 92274 Elapsed Time: 0:00:02
\ |                          #                    | 94870 Elapsed Time: 0:00:02
| |                           #                   | 97559 Elapsed Time: 0:00:02
/ |                            #                 | 100601 Elapsed Time: 0:00:02
- |                             #                | 103380 Elapsed Time: 0:00:02
\ |                              #               | 105982 Elapsed Time: 0:00:03
| |                               #              | 108831 Elapsed Time: 0:00:03
/ |                                #             | 111836 Elapsed Time: 0:00:03
- |                                 #            | 115060 Elapsed Time: 0:00:03
\ |                                  #           | 118063 Elapsed Time: 0:00:03
| |                                   #          | 121231 Elapsed Time: 0:00:03
/ |                                    #         | 124511 Elapsed Time: 0:00:03
- |                                     #        | 127809 Elapsed Time: 0:00:03
\ |                                      #       | 131152 Elapsed Time: 0:00:03
| |                                       #      | 134441 Elapsed Time: 0:00:03
/ |                                        #     | 138640 Elapsed Time: 0:00:04
- |                                         #    | 143307 Elapsed Time: 0:00:04
\ |                                          #   | 147760 Elapsed Time: 0:00:04
| |                                           #  | 152628 Elapsed Time: 0:00:04
/ |                                            # | 158752 Elapsed Time: 0:00:04
- |                                             #| 162960 Elapsed Time: 0:00:04
\ |                                             #| 166859 Elapsed Time: 0:00:04
| |                                            # | 171256 Elapsed Time: 0:00:04
| |                                            # | 172986 Elapsed Time: 0:00:04
=== 1/5 Counting and sorting n-grams ===
Reading /work/waha-tuhi/data/lm/base_encoder/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Warning: <s> appears in the input.  All instances of <s>, </s>, and <unk> will be interpreted as whitespace.
****************************************************************************************************
Unigram tokens 2978778 types 29560
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:354720 2:5339543040 3:10011643904 4:16018628608 5:23360501760
Statistics:
1 29560 D1=0.639748 D2=1.02063 D3+=1.43701
2 307654 D1=0.700202 D2=1.07513 D3+=1.42814
3 256739/1010772 D1=0.795858 D2=1.13648 D3+=1.42481
4 243294/1706845 D1=0.875455 D2=1.22982 D3+=1.43663
5 158645/2060538 D1=0.908299 D2=1.26413 D3+=1.37111
Memory estimate for binary LM:
type       kB
probing 22469 assuming -p 1.5
probing 27317 assuming -r models -p 1.5
trie    11048 without quantization
trie     5976 assuming -q 8 -b 8 quantization
trie     9927 assuming -a 22 array pointer compression
trie     4855 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:354720 2:4922464 3:5134780 4:5839056 5:4442060
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:354720 2:4922464 3:5134780 4:5839056 5:4442060
=== 5/5 Writing ARPA model ===
Name:lmplz      VmPeak:53646748 kB      VmRSS:22456 kB  RSSMax:9517180 kB       user:4.444      sys:5.044       CPU:9.49127     real:8.12519
Reading ../data/lm/base_encoder/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Reading ../data/lm/base_encoder/lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

Converting to lowercase and counting word occurrences ...

Saving top 500000 words ...

Calculating word statistics ...
  Your text file has 3324752 words in total
  It has 29559 unique words
  Your top-500000 words are 100.0000 percent of all words
  Your most common word "te" occurred 292223 times
  The least common word in your top-k is "kātoa" with 1 times
  The first word with 2 occurrences is "uke" at place 16509

Creating ARPA file ...

Filtering ARPA file using vocabulary of top-k words ...

Building lm.binary ...

Second step is making the lm.scorer…

Step 2 : Make lm.scorer

docker run --rm  -u root:root -v $(pwd)/../:/work/waha-tuhi -w /work/waha-tuhi/lm 473856431958.dkr.ecr.ap-southeast-2.amazonaws.com/waha-tuhi/lm:latest bash -c "\
python ../train/DeepSpeech/data/lm/generate_package.py \
--alphabet ../data/lm/base_encoder/alphabet.txt \
--lm ../data/lm/base_encoder/lm.binary \
--vocab ../data/lm/base_encoder/mi_vocab.txt \
--package ../data/lm/base_encoder/lm.scorer \
--default_alpha 1.47  \
--default_beta 3.49  \
--force_utf8=true"
16508 unique words read from vocabulary file.
Doesn't look like a character based model.
Package created in ../data/lm/base_encoder/lm.scorer

Clearly the lm.scorer is no dang good but

can anyone see what I might have done wrong?
is there a way to validate or debug the process of creating it?

lissyx · June 4, 2020, 6:50am

Why? Have you read the documentation? You likely don’t want to use this flag.

utunga · June 4, 2020, 6:55am

Ah.

Thanks for the super rapid response.
Yeah I gues what we wanted to do here is experiment with multiple ‘alphabets’ for our pronunciation project… to which end we are thinking of using DeepSpeech in utf8 mode so that its easy to swap out one alphabet for another and then do transfer training with different alphabets.

Tell you what though. Let me turn that sucker off and see if it fixes the problem. Thanks!

utunga · June 4, 2020, 6:59am

I should ask questions more often cause that was super helpful.

So I generated again this time without force_utf8 flag

python ../train/DeepSpeech/data/lm/generate_package.py \
--alphabet ../data/lm/base_encoder/alphabet.txt \
--lm ../data/lm/base_encoder/lm.binary \
--vocab ../data/lm/base_encoder/mi_vocab.txt \
--package ../data/lm/base_encoder/lm.scorer \
--default_alpha 1.47  \
--default_beta 3.49  "
16508 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Package created in ../data/lm/base_encoder/lm.scorer

And now it ‘works’ !

python3 -u DeepSpeech/DeepSpeech.py \
          --scorer_path                   '../data/lm/base_encoder/lm.scorer' \
          --alphabet_config_path    '../data/lm/base_encoder/alphabet.txt' \
          --checkpoint_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints \
          --summary_dir /work/waha-tuhi/models/20200603_ds.0.7.1_thm/summaries  \
          --one_shot_infer 'test.wav'
I Loading best validating checkpoint from /work/waha-tuhi/models/20200603_ds.0.7.1_thm/checkpoints/best_dev-29322
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
ka arohia katoatia te hāhi me ōna ƒakapono i te hapū o ōtākou

(I mean tbc its nonsense but thats cause I havent trained the acoustic model fully yet - but its obvs using the right langauge model now cause of the ƒakapono and such)

reuben · June 4, 2020, 7:06am

UTF-8 may indeed be useful for you, but it’s a bigger change than just using the flag when building the scorer, you need to train the acoustic model with --utf8, and also generate a character based LM by adding spaces between every character on the text input. We’ve added some docs: https://deepspeech.readthedocs.io/en/master/Decoder.html

utunga · June 4, 2020, 7:08am

So… this will solve my problem for now. But I’m just thinking about our plan to move on to do transfer learning later.

Basically what we hope to do is use transfer learning like this:

Do a full train (again) for 0.7.1
Use transfer training to ‘fine tune’ this for different alphabets
Use letter by letter confidences on those different alphabets and see if we can use that to provide pronunciation feedback (to folks learning how to speak te reo Māori correctly)

With that in mind, we thought we’d do everything in utf8 mode.

But I think what may have happened is that in an effort to get the train step to actually ‘work’ (last night) i turned off the ‘force_utf8’ flag during the training step. (Checking the code it looks like that’s what I did).

So I guess I need to leave ‘force_utf8’ off when building the scorer for it to work with that model?

So just to check, the two things correspond right?

force_utf8 during training => force_utf8 when making lm.scorer

utf8=false during training => utf8=false when making lm.scorer ?

utunga · June 4, 2020, 7:11am

Ah! OK cool. Thanks! Imma come back to that one I guess cause thats gonna be a whole thing Thanks

reuben · June 5, 2020, 9:05am

Right, but just a note, the training flag is called --utf8, not --force_utf8.

reuben · June 5, 2020, 9:07am

And also, you can probably use a non-UTF8, word based scorer with a UTF8 acoustic model, but it’s not something I’ve tried. UTF8 mode has been developed in a narrow scope, for Mandarin models which would have prohibitively large alphabets, and thus has also been tested in this narrow scope, using UTF8 AM + UTF8 LM only.

Topic		Replies	Views
Error when start test epoch DeepSpeech	8	817	June 20, 2020
"Doesn't look like a character based model" DeepSpeech	3	1024	May 27, 2020
Generating own scorer file DeepSpeech	41	6932	November 14, 2020
Generate_scorer_package error creating language model DeepSpeech	4	1045	September 17, 2021
Error while generating own scorer DeepSpeech	5	686	November 27, 2020

Debugging/validating custom lm.scorer

Related topics