Language Model Creation

I’m trying to create a trie and am encountering the following error:

terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException.
The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers
Aborted (core dumped)

I am following the tutorial here which was used to make a French model. I was able to make the binary with the following in KenLM:

/lmplz --text vocabulary.txt --arpa words.arpa --o 3
/build_binary -T -s words.arpa language.binary

I’m attempting to make the trie as follows:

native_client/generate_trie alphabet.txt language.binary vocabulary.txt trie

According to this Github issue this may be caused by the switch in the language model tooling. I changed my command to generate the binary by adding “trie” for [type] as recommended in the issue.

/build_binary -T -s trie words.arpa language.binary

That didn’t seem to make any change. I am using the native client master that was downloaded from the task cluster (it downloaded from here https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.master.cpu/artifacts/public/native_client.tar.xz).

I think this is a simple flag issue in creating the binary but I’m not sure what to change.

I should also say I’m using a KenLM build that is separate from my DeepSpeech installation. It seems the KenLM in DeepSpeech can be used for inference but also could be built fully to generate the binaries. I didn’t do this so I’m not sure if that is causing a problem.

Thanks for an amazing project!

Looks like the generate_trie mismatches. Care to share the ./deepspeech -h output?

Here is the output of deepspeech -h run from the DeepSpeech directory:

usage: deepspeech [-h] model audio alphabet [lm] [trie]

Benchmarking tooling for DeepSpeech native_client.

positional arguments:
  model       Path to the model (protocol buffer binary file)
  audio       Path to the audio file to run (WAV format)
  alphabet    Path to the configuration file specifying the alphabet used by
              the network
  lm          Path to the language model binary file
  trie        Path to the language model trie file created with
              native_client/generate_trie

optional arguments:
  -h, --help  show this help message and exit

Let me know what I screwed up :wink:

That looks like 0.1.1, isn’t it?

Yes when I install deepspeech it shows 0.1.1 as the version installed:

deepspeech in /usr/local/lib/python3.6/dist-packages (0.1.1)

So you are mixing 0.1.1 and master? It’s not going to work.

Aaaah ok. Hmm so I installed DeepSpeech 0.1.1 but the native client for master. Is there a recommended version master or 0.1.1?

I don’t see a method to install a specific version of deepspeech in the documentation. I’m missing something obviously.

Just make sure you use the same versions .for everything. If you use master, you need to use LM and trie from Git LFS, and not the one in models.tar.gz from 0.1.1

We now have tooling on util/taskcluster.py to specify version. If you download from pypi or npm, you can also use alpha releases that have been pushed there.

Sorry for the long delay. I’ve been away on vacation. I’m attempting to install v0.1.1 of the native client via the following command:

/content/DeepSpeech/util/taskcluster.py --branch "v0.1.1" --target /content/DeepSpeech/native_client

When doing so I’m getting a URL error. Here is the full output:

Downloading https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz ...
Traceback (most recent call last):
  File "/content/DeepSpeech/util/taskcluster.py", line 87, in <module>
    maybe_download_tc(target_dir=args.target, tc_url=get_tc_url(args.arch, args.artifact, args.branch))
  File "/content/DeepSpeech/util/taskcluster.py", line 51, in maybe_download_tc
    urllib.request.urlretrieve(tc_url, target_file, reporthook=(report_progress if progress else None))
  File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I seem to be passing through the wrong version number. I tried various combinations but am pretty much just guessing (0.1.1, 0.1.1-gpu, etc).

I don’t see a list of version numbers in the documentation however.

Also I should say I’m not trying to use the pre trained models or to do transfer learning. We are testing out a small data set in Mongolian. Thanks!

Hi @robertritz
have you had any luck generating the Language Model?
I hope you did.
I am trying to train Mozilla DeepSpeech in Mongolian language. And it seems I need to create the Language Model. (which i am having problem)
Can you help me?

I did have luck on a small dataset, but didn’t try scaling it up since then. If you are interested there is a premade Mongolian language model created specifically for use with DeepSpeech. Link

MODEL 5-gram binary LM generated by KenLM on a 670M word dirty corpus.

@robertritz
Tnx. I have successfully generated words.arpa & lm.binary using KenLM.

currently trying to generate trie


but it seems something is wrong.

Please avoid posting screenshots.

Your data is wrong.

Did not know I should avoid posting screenshots. (i’ll keep it in mind)
I have .csv files converted into 1 .txt file containing only the transcripts.
(wav_filename.wav, wav_filesize columns and headers all deleted)

IF you want you check out the Google Colab link i’ve provided earlier.
I’ll double check tomorrow again. To make sure i did not upload wrong lm.binary to my google drive (which is using in the deepspeech - generate_trie command)

Well, what I can read from your error is exactly that your LM or trie file contains CSV headers …

I am trying to GENERATE TRIE file.

as for the

i will do it once again and share the results here :slight_smile:

@lissyx, many issues folks face seem to be version mismatches one way or another. Are there any plans to streamline things?