Generate_lm.py subprocess.CalledProcessError

Hi everyone ! First of all, happy new year !
I got a new problem with this generate_lm.py :frowning: let’s say that it’s the first of this year !

I tryed to follow the doc to validate my virtual environment but I got this :
(doc : https://deepspeech.readthedocs.io/en/latest/Scorer.html )

(venv) nathan@nathan-G771JM:~/PycharmProjects/DeepSpeech/DeepSpeech/data/lm$ python3 generate_lm.py --input_txt librispeech-lm-norm.txt.gz --output_dir . --top_k 500000 --kenlm_bins …/…/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

Converting to lowercase and counting word occurrences …
| | # | 40418260 Elapsed Time: 0:15:38

Saving top 500000 words …

Calculating word statistics …
Your text file has 803288729 words in total
It has 973673 unique words
Your top-500000 words are 99.9354 percent of all words
Your most common word “the” occurred 49059384 times
The least common word in your top-k is “corders” with 2 times
The first word with 3 occurrences is “zungwan” at place 420186

Creating ARPA file …
=== 1/5 Counting and sorting n-grams ===
Reading /home/nathan/PycharmProjects/DeepSpeech/DeepSpeech/data/lm/lower.txt.gz
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
******************************Traceback (most recent call last):
File “generate_lm.py”, line 210, in
main()
File “generate_lm.py”, line 201, in main
build_lm(args, data_lower, vocab_str)
File “generate_lm.py”, line 97, in build_lm
subprocess.check_call(subargs)
File “/usr/local/lib/python3.6/subprocess.py”, line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[’…/…/kenlm/build/bin/lmplz’, ‘–order’, ‘5’, ‘–temp_prefix’, ‘.’, ‘–memory’, ‘85%’, ‘–text’, ‘./lower.txt.gz’, ‘–arpa’, ‘./lm.arpa’, ‘–prune’, ‘0’, ‘0’, ‘1’]’ died with <Signals.SIGKILL: 9>.

I’m running this with :
Ubuntu 20.04.1 LTS (64 bits)
Intel® Core™ i5-4200H CPU @ 2.80GHz × 4
NV117 / Intel® HD Graphics 4600 (HSW GT2)
Python 3.6

Idk if it’s kenlm or generale_lm.py problem, if you could help :slight_smile: !

Please format output for future posts.

This usually indicates that you or sth sent a kill 9 signal to the process. How much material do you use as input and could you reach some threshold on the machine?

Try some small amount of text (e.g. 100MB) to check that everything is working. Then use more material. Try setting the memory to around 70% for generation if the kill comes from another source.

Thank you for your answer, I tried to reproduce the external score so the input is about 4.3 GB.
I made a new test this morning and it’s a memory problem I guess. Stop when memory and exchange cache are 100 % used.

I guess that my computer memory is struggling, I will tried with a small amount of text.
I need to understand how to make one first :grin:

1 Like

Please read out request how to post, no images.

1 Like

As said above you should first make sure that everything is working. Then try lower memory values. Check the output, kenlm is quite good in reporting what’s wrong.

Ok, so I did this : I toke the first 26 lines of the librispeech :

A
A A
A A A
A A A A
A A A A A
A A A A A A A A A A A A A A
A A A A A AH
A A A A A AH THE CRY WAS WRUNG FROM JOHNNIE
A A A A A BOVE SECOND SINGER DIMINUENDO
A A A A A MEN
A A A A A Y
A A A A AHOWOOH
A A A A ALL ABOARD
A A A A ARE FOUR PIECES OF WIRE OF THE SAME THICKNESS AS USED FOR THE PRECEDING NET
A A A A CITY IN SOUTH AMERICA
A A A A H
A A A A L L S WELL
A A A A OBSERVED M’TELA INTERESTEDLY
A A A A ONE OF THE UNITED STATES
A A A A RIVER IN SOUTH AMERICA
A A A A Y
A A A AH
A A A AH A A A AH
A A A AN ISTHMUS
A A A AS IN FA THER
A A A AS IN MARE

First I got this :

{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j’.
ERROR: 1-gram discount out of range for adjusted count 2: -3.4893618. This means modified Kneser-Ney smoothing thinks something is weird about your data. To override this error for e.g. a class-based model, rerun with --discount_fallback

Maybe because the data is very little. By rerun with --discount_fallback I got a SUCCESS.
I guess the problem is that the input is to much for my machine.

Yep, the model is meant to compute many thousands of words, if it gets just a couple it gives this warning.

Generation works, try lower mem values.

1 Like