Language Model Creation

So you are mixing 0.1.1 and master? It’s not going to work.

Aaaah ok. Hmm so I installed DeepSpeech 0.1.1 but the native client for master. Is there a recommended version master or 0.1.1?

I don’t see a method to install a specific version of deepspeech in the documentation. I’m missing something obviously.

Just make sure you use the same versions .for everything. If you use master, you need to use LM and trie from Git LFS, and not the one in models.tar.gz from 0.1.1

We now have tooling on util/taskcluster.py to specify version. If you download from pypi or npm, you can also use alpha releases that have been pushed there.

Sorry for the long delay. I’ve been away on vacation. I’m attempting to install v0.1.1 of the native client via the following command:

/content/DeepSpeech/util/taskcluster.py --branch "v0.1.1" --target /content/DeepSpeech/native_client

When doing so I’m getting a URL error. Here is the full output:

Downloading https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.1.1.cpu/artifacts/public/native_client.tar.xz ...
Traceback (most recent call last):
  File "/content/DeepSpeech/util/taskcluster.py", line 87, in <module>
    maybe_download_tc(target_dir=args.target, tc_url=get_tc_url(args.arch, args.artifact, args.branch))
  File "/content/DeepSpeech/util/taskcluster.py", line 51, in maybe_download_tc
    urllib.request.urlretrieve(tc_url, target_file, reporthook=(report_progress if progress else None))
  File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I seem to be passing through the wrong version number. I tried various combinations but am pretty much just guessing (0.1.1, 0.1.1-gpu, etc).

I don’t see a list of version numbers in the documentation however.

Also I should say I’m not trying to use the pre trained models or to do transfer learning. We are testing out a small data set in Mongolian. Thanks!

Hi @robertritz
have you had any luck generating the Language Model?
I hope you did.
I am trying to train Mozilla DeepSpeech in Mongolian language. And it seems I need to create the Language Model. (which i am having problem)
Can you help me?

I did have luck on a small dataset, but didn’t try scaling it up since then. If you are interested there is a premade Mongolian language model created specifically for use with DeepSpeech. Link

MODEL 5-gram binary LM generated by KenLM on a 670M word dirty corpus.

@robertritz
Tnx. I have successfully generated words.arpa & lm.binary using KenLM.

currently trying to generate trie


but it seems something is wrong.

Please avoid posting screenshots.

Your data is wrong.

Did not know I should avoid posting screenshots. (i’ll keep it in mind)
I have .csv files converted into 1 .txt file containing only the transcripts.
(wav_filename.wav, wav_filesize columns and headers all deleted)

IF you want you check out the Google Colab link i’ve provided earlier.
I’ll double check tomorrow again. To make sure i did not upload wrong lm.binary to my google drive (which is using in the deepspeech - generate_trie command)

Well, what I can read from your error is exactly that your LM or trie file contains CSV headers …

I am trying to GENERATE TRIE file.

as for the

i will do it once again and share the results here :slight_smile:

@lissyx, many issues folks face seem to be version mismatches one way or another. Are there any plans to streamline things?

Streamline what exactly ?

similar to the change we did in taskcluster.py, ideally, one should not juggle with handling different versions of files manually, ideally.

I don’t see the link with the current thread, can you elaborate ?

I see, I’ll try to come back with a more concrete example. Thanks for the great work. We have about 80% accuracy thanks to you. :slight_smile:


it is fixed. Google drive import had some problems… (bow)
words.arpa
lm.binary and trie is generated successfully. hooray

1 Like