Generating higher order scorer

I am trying to generate a scorer with order 8. Here are the steps I followed:

  • Compiled kenlm in my system to support upto max order 10.
  • Ran the generate_lm.py script to create lm.binary. The script ran without any errors. Exact command used:
python3 generate_lm.py --input_txt corpus.txt --output_dir . --top_k 1000000 --kenlm_bins /path/to/kenlm/build/bin/ --arpa_order 8 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
  • Ran the generate_scorer_package script to create scorer. Exact command used:
./generate_scorer_package --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-1000000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

However, when running the generate_scorer_package, I get the following error:

terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/model.cc:49 in void lm::ngram::detail::{anonymous}::CheckCounts(const std::vector<long unsigned int>&) threw FormatLoadException because `counts.size() > 6'.
This model has order 8 but KenLM was compiled to support up to 6.  If your build system supports changing KENLM_MAX_ORDER, change it there and recompile.  With cmake:
 cmake -DKENLM_MAX_ORDER=10 ..
With Moses:
 bjam --max-kenlm-order=10 -a
Otherwise, edit lm/max_order.hh.

I have already compiled kenlm to support upto 10 order but why is it still throwing this error?

P.S. This is DeepSpeech v0.8.2 and when I try to generate scorer with order 5, the same scripts work without any issues.

How did you do this ?

  1. Downloaded KenLM
    wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
  2. Opened file kenlm/lm/CMakeLists.txt and changed line 37 to
    set(KENLM_MAX_ORDER 10 CACHE STRING "Maximum supported ngram order")
  3. Opened file kenlm/util/have.hh and added these lines
#ifndef KENLM_MAX_ORDER
#define KENLM_MAX_ORDER = 10
#endif
  1. Compiled kenLM
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2

So, this is wrong: as you can deduce by the error,

This is code that is included in our tree and is being built inside generate_scorer_package. So your change has no impact. A git grep KENLM_MAX_ORDER would hint you to change the value at https://github.com/mozilla/DeepSpeech/blob/bcfc74874f5a3078cd2f731c625430db1394dd0c/native_client/BUILD#L79 and then rebuild generate_scorer_package.

Oh. alright. Thanks a lot.