Problems creating Trie file

Plato · March 20, 2020, 7:07pm

specs:
OS: Ubuntu 18.04
DeepSpeech: 0.6.1
Local clone of the git: HEAD detached at v0.6.1
Have i written custom code: No, except i changed the url in data/lm/generate_lm.py and changed the 500k words to 600k.

I can build my lm.binary without a problem, i also followed the build steps that i found in here.
When trying to generate my Trie like this:

…/tensorflow/bazel-bin/native_client/generate_trie alphabet.txt lm.binary trie

it gives me the following error:

terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/model.cc:70 in lm::ngram::detail::GenericModel<Search, VocabularyT>::GenericModel(const char*, const lm::ngram::Config&) [with Search = lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>; VocabularyT = lm::ngram::SortedVocabulary] threw FormatLoadException because `new_config.enumerate_vocab && !parameters.fixed.has_vocabulary'.
The decoder requested all the vocabulary strings, but this binary file does not have them.  You may need to rebuild the binary file with an updated version of build_binary.
Aborted (core dumped)

I’m not sure what has caused this problem (i’m assuming it is not a problem with the deepspeech code), but could someone help me?
My understanding is that from 0.7.0 onward the trie and lm will be replaced by a scorer file, but since i’m working on the head of 0.6.1 I assume this also not is the problem? Maybe i cloned the wrong version of kenlm from github?

ps: running /PATH/native_client/generate_trie without input, gives me the following output:
Usage: ../tensorflow/bazel-bin/native_client/generate_trie <alphabet> <lm_model> <trie_path>

lissyx · March 20, 2020, 7:08pm

Why do you rebuild generate_trie when we provide it in native_client.tar.xz ? This adds one layer of uncertainty.

Could you ls -hal all the vocab / lm files ? I suspect it was improperly generated.

Plato · March 24, 2020, 1:08pm

Why do you rebuild generate_trie when we provide it in native_client.tar.xz?

not sure, because it was the first i found on the internet/forums

Could you ls -hal all the vocab / lm files ? I suspect it was improperly generated.

This was indeed the problem (improperly generated). I had to remove the -v parameter in the ‘build_binary’ part of the generate_lm.py file, thank you!

lissyx · March 24, 2020, 12:02pm

From the release page ? From the documented util/taskcluster.py ?

Gang_He · March 24, 2020, 12:56pm

python util/taskcluster.py --target native_client

I get native_client and use generate_trie to generate trie file. however throws error:

terminate called after throwing an instance of 'lm::FormatLoadException' what(): native_client/kenlm/lm/model.cc:70 in lm::ngram::detail::GenericModel<Search, VocabularyT>::GenericModel(const char*, const lm::ngram::Config&) [with Search = lm::ngram::trie::TrieSearch<lm::ngram::SeparatelyQuantize, lm::ngram::trie::ArrayBhiksha>; VocabularyT = lm::ngram::SortedVocabulary] threw FormatLoadException because new_config.enumerate_vocab && !parameters.fixed.has_vocabulary'.

The decoder requested all the vocabulary strings, but this binary file does not have them. You may need to rebuild the binary file with an updated version of build_binary.
Aborted (core dumped)
`

how to fix this?

lissyx · March 24, 2020, 1:07pm

Have you properly created the LM files ? Can you reproduce with released LM files ? We need more context …

It’s also very irritating for everyone that you hijack existing thread to re-ask the same question as the poster of the original thread.

If you had paid attention to the thread you are replying, you would have found the solution.

Gang_He · March 24, 2020, 2:31pm

thanks you! I am working on mandarin asr. I have create lm.arpa and lm.binary files using generate_lm.py with kenlm tools. I found that the kenlm tool I am using is different with kenlm in native_client/kenlm. The kenlm tool I am using is install from github with “pip install https://github.com/kpu/kenlm/archive/master.zip” cmd. Also I found that in source codes of DeepSpeech v0.6.1, kenlm can not be compiled correctly with
mkdir -p build
cd build
cmake …
make -j 4

lissyx · March 24, 2020, 2:32pm

You should not use those sources, but build from upstream, as we document.

Gang_He · March 24, 2020, 2:45pm

Do you mean build whole Deepspeech codes with bazel?

derek · March 27, 2020, 3:29am

This is the correct answer. Thank you.