Getting subprocess call error while building scorer

I am trying to build customised scorer (language model) for speech-to-text using DeepSpeech in colab. While calling generate_lm.py getting this error:

File “generate_lm.py”, line 210, in
main()
File “generate_lm.py”, line 201, in main
build_lm(args, data_lower, vocab_str)
File “generate_lm.py”, line 126, in build_lm
binary_path,
File “/usr/lib/python3.7/subprocess.py”, line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[’/content/DeepSpeech/native_client/kenlm/build/bin/build_binary’, ‘-a’, ‘255’, ‘-q’, ‘8’, ‘-v’, ‘trie’, ‘/content/DeepSpeech/data/lm/lm_filtered.arpa’, ‘/content/DeepSpeech/data/lm/lm.binary’]’ died with <Signals.SIGSEGV: 11>.

Calling the script generate_lm.py like this :

! python3 generate_lm.py --input_txt hindi_tokens.txt --output_dir /content/DeepSpeech/data/lm --top_k 500000 --kenlm_bins /content/DeepSpeech/native_client/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

I am having the exact same problem, still haven’t figured out the solution, the worse part is that I had it working and building different scorers just over a month ago. I had to clean my environment, and although I think I’m following the same steps setting up my environment it’s now giving me this error.

I actually think I figured it out, if not at least a work around, try reducing the top_k. I was making a smaller scorer (only have just over 3000 phrases in it and had to use --discount_fallback), but when I dropped top_k to 965 (lots of trial and error to find that number) it worked.

Thank you ! By reducing top_k to 15000 i am able to execute the code. I have 42594 phrases. But is there any way to find the suitable value for top_k without manual checking? Also didn’t get any details about top_k parameter. Please share if you know more about this. Thanks for replying.

You are most welcome, I’m glad it worked for you as well. I also struggled to find a good value for top_k, looking at the values that worked for us I’m guessing it might be around 1/3rd total phrases, but that could be a coincidence and it could have to do with how many unique words there are. From what I could figure out, top_k is how many of the most common words you would like used in the scorer, although I am unsure how exactly the scorer uses that information.