Error when creating my own scorer file

hi dear friends:
Latest 0.7.3 version, chinese train. When I use below two commands to generate my own scorer file, there is one error message like below:
python3 generate_lm.py --input_txt vocabulary.txt --output_dir . --top_k 500000 --kenlm_bins /home/parallels/Desktop/ASR/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

(deepspeech-0.7-train) parallels@parallels-Parallels-Virtual-Platform:~/Desktop/ASR/mozilla/DeepSpeech-0.7/data/lm$ python3 generate_package.py --alphabet alphabet.txt --lm lm.binary --vocab vocab-500000.txt --package chinese.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284
4173 unique words read from vocabulary file.
Looks like a character based model.
Using detected UTF-8 mode: True
Error when creating chinese.scorer
swig/python detected a memory leak of type ‘Alphabet *’, no destructor found.

But I see that chinese.scorer file is there, so I want to know what error it is? Is it harmless? Or what I should I do if I want to avoid it?
Thanks.

Isn’t it latest master ?

Can you confirm what git commit you are on? We just changed save_dictionary to return true / false, so that would mean there was an error when writing it?

It is here, but is it valid?
Can you make sure the console output you shared is sound and complete and it is not missing any error message?

Is it possible you are not really using latest ds_ctcdecoder build but 0.7.3 which would not return the value, and thus the error message you have is expected but harmless because in fact there was no error?

Yes, it is, I just git pull today.

I am at:
commit 2514b6793303fb3fab493b8a715c986682216a2c (HEAD → master, origin/master, origin/HEAD)
Merge: e6135bbb 6c2cbbd7
Author: lissyx 1645737+lissyx@users.noreply.github.com
Date: Wed Jun 17 10:09:15 2020 +0200

Merge pull request #3066 from lissyx/output-stream-error

Fix #3053: Check output stream when producing scorer

Yes, it is there, but I am not sure if it is valid, but I could use it to train, and it do not report errors like below:
python -u DeepSpeech.py
–train_files /home/parallels/Desktop/ASR/mozilla/mozilla_common_voice_zh/zh-CN/train.csv
–dev_files /home/parallels/Desktop/ASR/mozilla/mozilla_common_voice_zh/zh-CN/validated.csv
–test_files /home/parallels/Desktop/ASR/mozilla/mozilla_common_voice_zh/zh-CN/test.csv
–train_batch_size 80
–dev_batch_size 80
–test_batch_size 40
–n_hidden 1024
–epochs 100
–dropout_rate 0.22
–learning_rate 0.0001
–report_count 100
–export_dir /home/parallels/Desktop/ASR/mozilla/mozilla_common_voice_zh/zh-CN/results/model_export/
–alphabet_config_path /home/parallels/Desktop/ASR/mozilla/mozilla_common_voice_zh/zh-CN/alphabet.txt
–checkpoint_dir /home/parallels/Desktop/ASR/mozilla/mozilla_common_voice_zh/zh-CN/results/checkout/
–scorer_path /home/parallels/Desktop/ASR/mozilla/DeepSpeech-0.7/data/lm/kenlm.scorer
–early_stop False
–utf8 True
“$@”

Yes, I am sure that’s all output.

I am not sure, I just use:
pip3 install --upgrade --force-reinstall -e .
to install all dependences.

Thanks.

If you have an error, you are at that sha1, and the scorer file works, it’s likely to be fallout from my changes. This should fix it: https://github.com/mozilla/DeepSpeech/pull/3078

save_dic = scorer.save_dictionary(package_path, True)
if save_dic is None or save_dic:

I use above two lines code, it shows good.

BTW, there are so many errors:
swig/python detected a memory leak of type ‘Alphabet *’, no destructor found.

what it is? why it happens? Do I miss somethings?
Thanks.

Those errors are already reported on Github (and fixed).

Thanks, it confirms the issue. We’re going to instead make a 0.7.4, that should fix it and the leaking reported.

oh, how could I get the fix? I just pull the code, but the error still shows up.
Thanks.

As I said: 0.7.4 will get released. You need to rebuild ds_ctcdecoder from sources in the meantime if you want it.

hi lissyx:
I tried your latest version 0.7.4, but I got below error:
(ds-train-0.7.4) (base) chenyuz@chenyuz-y7000p:~/Desktop/ASR/mozilla/DeepSpeech/data/lm$ python3 generate_lm.py --input_txt vocabulary.txt --output_dir . --top_k 500000 --kenlm_bins /home/chenyuz/Desktop/ASR/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Converting to lowercase and counting word occurrences …
| | # | 28874 Elapsed Time: 0:00:00

Saving top 500000 words …

Calculating word statistics …
Your text file has 489984 words in total
It has 4173 unique words
Your top-500000 words are 100.0000 percent of all words
Your most common word “的” occurred 16667 times
The least common word in your top-k is “裹” with 1 times
The first word with 2 occurrences is “泱” at place 3954

Creating ARPA file …
=== 1/5 Counting and sorting n-grams ===
Reading /home/chenyuz/Desktop/ASR/mozilla/DeepSpeech/data/lm/lower.txt.gz
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Traceback (most recent call last):
File “generate_lm.py”, line 210, in
main()
File “generate_lm.py”, line 201, in main
build_lm(args, data_lower, vocab_str)
File “generate_lm.py”, line 97, in build_lm
subprocess.check_call(subargs)
File “/home/chenyuz/anaconda3/lib/python3.7/subprocess.py”, line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[’/home/chenyuz/Desktop/ASR/kenlm/build/bin/lmplz’, ‘–order’, ‘5’, ‘–temp_prefix’, ‘.’, ‘–memory’, ‘85%’, ‘–text’, ‘./lower.txt.gz’, ‘–arpa’, ‘./lm.arpa’, ‘–prune’, ‘0’, ‘0’, ‘1’]’ died with <Signals.SIGSEGV: 11>.
(ds-train-0.7.4) (base) chenyuz@chenyuz-y7000p:~/Desktop/ASR/mozilla/DeepSpeech/data/lm$

1 Like

Please reproduce outside of the python call and share a gdb stack, there’s nothing we can do.

Please also make sure you are not mixing things: Segmentation fault using utf8 scorer with non utf8 data. · Issue #2875 · mozilla/DeepSpeech · GitHub