Deepspeech docker image language model lmplz segmentation fault

Hi,

I’m trying to build my own lm buy following instructions in the link for deepspeech 0.9.3:

https://mozilla.github.io/deepspeech-playbook/SCORER.html#using–lm-optimizerpy–to-generate-values-for-the-parameters----default-alpha–and----default-beta–that-are-used-by-the–generate-scorer-package–script

The environment I’m using is as described here:
https://mozilla.github.io/deepspeech-playbook/ENVIRONMENT.html

The docker image runs fine. The problem is when I try to generate the lm.binary and vocab-500000.txt files.

Running the following command causes a segmentation fault.

python3 generate_lm.py \
 --input_txt /<Location_to_my_sentences> \
 --output_dir /DeepSpeech/deepspeech-data/ \
 --top_k 500000 --kenlm_bins /DeepSpeech/native_client/kenlm/build/bin/ \
 --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
 --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

I have tried compiling a new binary for kenlm on the container, but it results in the same error. Another solution I found was to upgrade the boost version to 1.67, again this did not fix the issue.

Has anyone tried the docker image and ran into the same problem?

One extra bit of information…I’m running the docker under ubuntu WSL2 on windows.

I did some experiments and kenlm seem to work fine under a ubuntu VM, just not in WSL2.

Does anyone know why?

The memory allocate for the docker container is 8GB. I’ve also tired a small test app that malloc 1GB of memory and fills it in with random data. This works fine in the docker image under WLS2. kenlm by default seems to be using about 1GB too but it does not work in WSL2.