Language Model Creation

robertritz · July 10, 2018, 5:23am

I’m trying to create a trie and am encountering the following error:

terminate called after throwing an instance of 'lm::FormatLoadException'
  what():  native_client/kenlm/lm/binary_format.cc:131 in void lm::ngram::MatchCheck(lm::ngram::ModelType, unsigned int, const lm::ngram::Parameters&) threw FormatLoadException.
The binary file was built for probing hash tables but the inference code is trying to load trie with quantization and array-compressed pointers
Aborted (core dumped)

I am following the tutorial here which was used to make a French model. I was able to make the binary with the following in KenLM:

/lmplz --text vocabulary.txt --arpa words.arpa --o 3
/build_binary -T -s words.arpa language.binary

I’m attempting to make the trie as follows:

native_client/generate_trie alphabet.txt language.binary vocabulary.txt trie

According to this Github issue this may be caused by the switch in the language model tooling. I changed my command to generate the binary by adding “trie” for [type] as recommended in the issue.

/build_binary -T -s trie words.arpa language.binary

That didn’t seem to make any change. I am using the native client master that was downloaded from the task cluster (it downloaded from here https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.master.cpu/artifacts/public/native_client.tar.xz).

I think this is a simple flag issue in creating the binary but I’m not sure what to change.

I should also say I’m using a KenLM build that is separate from my DeepSpeech installation. It seems the KenLM in DeepSpeech can be used for inference but also could be built fully to generate the binaries. I didn’t do this so I’m not sure if that is causing a problem.

Thanks for an amazing project!

lissyx · July 10, 2018, 7:02am

Looks like the generate_trie mismatches. Care to share the ./deepspeech -h output?

robertritz · July 10, 2018, 7:59am

Here is the output of deepspeech -h run from the DeepSpeech directory:

usage: deepspeech [-h] model audio alphabet [lm] [trie]

Benchmarking tooling for DeepSpeech native_client.

positional arguments:
  model       Path to the model (protocol buffer binary file)
  audio       Path to the audio file to run (WAV format)
  alphabet    Path to the configuration file specifying the alphabet used by
              the network
  lm          Path to the language model binary file
  trie        Path to the language model trie file created with
              native_client/generate_trie

optional arguments:
  -h, --help  show this help message and exit

Let me know what I screwed up

lissyx · July 10, 2018, 8:08am

That looks like 0.1.1, isn’t it?

robertritz · July 10, 2018, 8:10am

Yes when I install deepspeech it shows 0.1.1 as the version installed:

deepspeech in /usr/local/lib/python3.6/dist-packages (0.1.1)

lissyx · July 10, 2018, 8:27am

So you are mixing 0.1.1 and master? It’s not going to work.

robertritz · July 10, 2018, 8:32am

Aaaah ok. Hmm so I installed DeepSpeech 0.1.1 but the native client for master. Is there a recommended version master or 0.1.1?

robertritz · July 10, 2018, 8:39am

I don’t see a method to install a specific version of deepspeech in the documentation. I’m missing something obviously.

lissyx · July 10, 2018, 9:51am

Just make sure you use the same versions .for everything. If you use master, you need to use LM and trie from Git LFS, and not the one in models.tar.gz from 0.1.1

lissyx · July 10, 2018, 9:51am

We now have tooling on util/taskcluster.py to specify version. If you download from pypi or npm, you can also use alpha releases that have been pushed there.

robertritz · August 8, 2018, 2:28am

Sorry for the long delay. I’ve been away on vacation. I’m attempting to install v0.1.1 of the native client via the following command:

/content/DeepSpeech/util/taskcluster.py --branch "v0.1.1" --target /content/DeepSpeech/native_client

When doing so I’m getting a URL error. Here is the full output:

Downloading https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz ...
Traceback (most recent call last):
  File "/content/DeepSpeech/util/taskcluster.py", line 87, in <module>
    maybe_download_tc(target_dir=args.target, tc_url=get_tc_url(args.arch, args.artifact, args.branch))
  File "/content/DeepSpeech/util/taskcluster.py", line 51, in maybe_download_tc
    urllib.request.urlretrieve(tc_url, target_file, reporthook=(report_progress if progress else None))
  File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I seem to be passing through the wrong version number. I tried various combinations but am pretty much just guessing (0.1.1, 0.1.1-gpu, etc).

I don’t see a list of version numbers in the documentation however.

robertritz · August 10, 2018, 3:35am

Also I should say I’m not trying to use the pre trained models or to do transfer learning. We are testing out a small data set in Mongolian. Thanks!

biligunb · October 17, 2019, 8:41am

Hi @robertritz
have you had any luck generating the Language Model?
I hope you did.
I am trying to train Mozilla DeepSpeech in Mongolian language. And it seems I need to create the Language Model. (which i am having problem)
Can you help me?

robertritz · October 17, 2019, 9:39am

I did have luck on a small dataset, but didn’t try scaling it up since then. If you are interested there is a premade Mongolian language model created specifically for use with DeepSpeech. Link

MODEL 5-gram binary LM generated by KenLM on a 670M word dirty corpus.

it can be used either with mozilla/DeepSpeech: ./generate_trie alphabet.txt mn_5gram.binary trie

or in tugstugi/mongolian-speech-recognition

biligunb · October 17, 2019, 9:44am

@robertritz
Tnx. I have successfully generated words.arpa & lm.binary using KenLM.

Language model creation : Mongolian Language - #3 by biligunb

currently trying to generate trie

but it seems something is wrong.

lissyx · October 17, 2019, 10:04am

Please avoid posting screenshots.

Your data is wrong.

biligunb · October 17, 2019, 10:07am

Did not know I should avoid posting screenshots. (i’ll keep it in mind)
I have .csv files converted into 1 .txt file containing only the transcripts.
(wav_filename.wav, wav_filesize columns and headers all deleted)

IF you want you check out the Google Colab link i’ve provided earlier.
I’ll double check tomorrow again. To make sure i did not upload wrong lm.binary to my google drive (which is using in the deepspeech - generate_trie command)

lissyx · October 17, 2019, 10:13am

Well, what I can read from your error is exactly that your LM or trie file contains CSV headers …

biligunb · October 18, 2019, 12:58am

I am trying to GENERATE TRIE file.

as for the

i will do it once again and share the results here

safas · October 18, 2019, 9:11am

@lissyx, many issues folks face seem to be version mismatches one way or another. Are there any plans to streamline things?

Topic		Replies	Views
Terminate called after throwing an instance of 'lm::FormatLoadException' DeepSpeech participation , learning	23	4156	October 18, 2018
Problems creating Trie file DeepSpeech	9	982	March 27, 2020
KenLM LM vs trie DeepSpeech	7	2968	April 13, 2019
Can we use DeepSpeech for Vietnamese Speech To Text? DeepSpeech	38	7151	January 25, 2022
Error when run generate trie DeepSpeech	7	1499	April 1, 2019

Language Model Creation

Related topics