Generating trie

Dgg · March 14, 2020, 6:48am

is there an easy way to generate trie on custom lm rather than building from native client binaries

lissyx · March 14, 2020, 8:31am

please download native_client.tar.xz from releases page, it contains all you need

Dgg · March 14, 2020, 1:11pm

i dont find generate_trie in native_client.tar.xz.
i can see the following files in native_client.tar.xz :
LICENSE README.mozilla deepspeech deepspeech.h libdeepspeech.so

lissyx · March 14, 2020, 1:47pm

Are you looking at 0.6.1 release files?

Dgg · March 14, 2020, 1:52pm

yes i am looking at 0.6.1 release

lissyx · March 14, 2020, 1:52pm

Then look at the files for 0.6.1 releases, it’s there.

othiele · March 14, 2020, 3:52pm

There are prebuilt versions of the native client, if you cloned the repo, you can run

python3 util/taskcluster.py --target native_client

and it should download it for you. Afterwards you’ll have a generate trie.

Dgg · March 14, 2020, 8:28pm

thanks @lissyx . generated trie for my custom lm

Andreea_Georgiana_Sarca · March 28, 2020, 12:50pm

Hello! I performed python3 util/taskcluster.py --target native_client and generate_trie is still not around I’m pretty stuck at this step for days…any ideas? I use the 0.7.0 alpha.2 version

othiele · March 28, 2020, 1:15pm

I know this can be confusing, just ask more often, you are giving the right information The trie was replaced by the scorer for the 0.7 release. Build the lm.arpa and lm.binary as before, then run generate_package.py with the binary, txt-vocab file and alphabet as inputs:

https://github.com/mozilla/DeepSpeech/blob/master/data/lm/generate_package.py

Andreea_Georgiana_Sarca · March 28, 2020, 1:23pm

Thanks for fast replying! I already tried to run generate_package.py, as mentioned in the readme but it gave me the following error:
Traceback (most recent call last):
File “generate_package.py”, line 15, in
from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
ImportError: No module named ds_ctcdecoder

othiele · March 28, 2020, 1:32pm

Ah, you have to get the native_client and the decoder, try:

python3 util/taskcluster.py --decoder

to download it.

Dgg · March 28, 2020, 1:47pm

download native client from https://github.com/mozilla/DeepSpeech/releases . you will find generate_trie once you untar. then pass alphabet.txt, lm.binary and path to save the trie as flags to genrate_trie

Andreea_Georgiana_Sarca · March 28, 2020, 2:46pm

Which should be the location of this decoder? I placed it in DeepSpeech/data/lm, now I have a brand new error haha! 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer

othiele · March 28, 2020, 3:51pm

Sorry, so you run this in the DeepSpeech main folder:

pip install $(python3 util/taskcluster.py --decoder)

It downloads the wheel and places it in the virtualenv you are running

Andreea_Georgiana_Sarca · March 28, 2020, 3:56pm

Thank you a lot! Sorry for asking so many things but it’s my first time working with something so big, I am working on my bachelor’s degree so I’m pretty noob .

Andreea_Georgiana_Sarca · April 1, 2020, 1:53pm

Hello! I am still stuck at this error: 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
This appears when I execute this:

`python generate_package.py --alphabet path/alphabet.txt --lm 
path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 
--default_beta 1.85 --package kenlm.scorer`

cekhwang · April 1, 2020, 3:29pm

I trained Chinese model with v0.7.0-alpha3, also faced this problem

root@0cb4d86eab66:/Other_version/DeepSpeech/data/lm# python generate_package.py --lm /DeepSpeech/data/lm/lm.binary --vocab /DeepSpeech/data/all/alphabet.txt --package lm.scorer --default_alpha 0.75 --default_beta 1.85

6557 unique words read from vocabulary file.
Looks like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in lm.scorer

othiele · April 1, 2020, 3:52pm

@Andreea_Georgiana_Sarca How did you build the lm.binary file? Did you use all arguments set here? Because the command looks fine

github.com

mozilla/DeepSpeech/blob/master/data/lm/generate_lm.py

import gzip
import io
import os
import subprocess
import tempfile

from collections import Counter
from urllib import request

def main():
  # Grab corpus.
  url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'

  with tempfile.TemporaryDirectory() as tmp:
    data_upper = os.path.join(tmp, 'upper.txt.gz')
    print('Downloading {} into {}...'.format(url, data_upper))
    request.urlretrieve(url, data_upper)

    # Convert to lowercase and count word occurences.
    counter = Counter()

This file has been truncated. show original

Andreea_Georgiana_Sarca · April 1, 2020, 5:25pm

Yea well I was not using all the arguments.
So again I tried:
/home/andreea/kenlm/build/bin/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
/home/andreea/kenlm/build/bin/build_binary -a 255 -q 8 -v -trie words.arpa lm.binary

And for this last command I got: Quantization is only implemented in the trie data structure.

When I ran again the generate_package.py as I mentioned before I got exactly the same thing…

Topic		Replies	Views
Build the generate_trie binary DeepSpeech	9	1587	February 24, 2020
Language Model Creation DeepSpeech	24	3941	October 18, 2019
Creation of language model and trie DeepSpeech	28	12814	August 7, 2019
Problems creating Trie file DeepSpeech	9	990	March 27, 2020
Trie file creation DeepSpeech	11	751	August 6, 2020

Generating trie

Related topics