How can i add custom vocab.txt and build a language model lm.binary, trie for pretrained model v0.2.0

muruganrajenthirean · October 3, 2018, 9:02am

How can i add custom vocab.txt and build a language model lm.binary, trie for pretrained model v0.2.0.

Sir now that pretrained model vocab.txt(data/lm/vocab.txt) and i added my own vocab.txt for the same format.and i started to convert lm model, trie.

result: throws error

_…/…/new_native_client/kenlm/build/bin/lmplz -o 5 <vocab.txt >lm.arpa
_
=== 1/5 Counting and sorting n-grams ===
Reading /home/dell/Music/12-09-2018/DeepSpeech/data/own_lmm/vocab.txt
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

Unigram tokens 974571 types 973693
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684316 2:641120832 3:1202101632 4:1923362432 5:2804903936
/home/dell/Music/12-09-2018/DeepSpeech/new_native_client/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0’.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 3 because we didn’t observe any 1-grams with adjusted count 2; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback

…/…/new_native_client/kenlm/build/bin/build_binary lm.arpa lm.binary
Reading lm.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
End of file Byte: 0

I search this same issue, solution was not clear,

how can i combine vocab and build a custom LM sir.

lissyx · October 3, 2018, 9:04am

Can you document exactly your steps ? Are you following the data/lm/README.md documentation ? I did it no later than yesterday, no problem at all.

muruganrajenthirean · October 3, 2018, 9:11am

No sir. i am following that README.md. but it is only download, convert lower case and build a language model for librispeech-lm-norm.txt (4.3 GB (4,28,72,16,164 bytes)). but if i add my own vocab.txt., it is very complex. then how can i add this.

next i am following this instructions,

TUTORIAL : How I trained a specific french model to control my robot

but it was throws this error

lissyx · October 3, 2018, 9:21am

Then be more clear in describing what you do. You need to follow the steps documented in data/lm/README.md, but you have to adjust to add your own data. You will still need the data documented, however.

muruganrajenthirean · October 3, 2018, 9:29am

sorry sir i can’t understand what you said.

Adding vocab.txt + own vocab.txt(data/own_lm ):

Reference:

build language model:

(deepspeech-alpha08) dell@dell-OptiPlex-5050:~/Music/12-09-2018/DeepSpeech$ cd new_native_client/

step 1:git clone https://github.com/kpu/kenlm.git
step 2:cd kenlm
mkdir build
cd build
cmake …
step 3:

– To install Eigen3 in your home directory, copy paste this:
cd $HOME
wget -O - https://bitbucket.org/eigen/eigen/get/3.2.8.tar.bz2 |tar xj
rm CMakeCache.txt
export EIGEN3_ROOT=$HOME/eigen-eigen-07105f7124f9

– Configuring done
– Generating done
– Build files have been written to: /home/dell/Music/12-09-2018/DeepSpeech/new_native_client/kenlm/build

step 4:make -j 4
step 5:copy and paste alphabet.txt (data/own_lm)
step 6:…/…/new_native_client/kenlm/build/bin/lmplz -o 5 <vocab.txt >lm.arpa
step 7:…/…/new_native_client/kenlm/build/bin/build_binary lm.arpa lm.binary
step 8:…/…/new_native_client/generate_trie alphabet.txt lm.binary vocab.txt trie

i follow this instruction. previously it is working fine(vocab.txt- pretrained v0.1.0).
that the reason only i follw this instruction sir.

thank you so much for your reply sir.

lissyx · October 3, 2018, 9:37am

Well, sorry, but at that point, I cannot do it for you. You need to combine lower.txt with your own words, and use the resulting file to produce a new language model, as documented.

muruganrajenthirean · October 3, 2018, 9:40am

yes really sir

need to combine lower.txt with your own words

thank you sir i will do it.

muruganrajenthirean · October 3, 2018, 9:47am

@lissyx sir

sir what can i do numbers presents in my large vocab.txt.
alphabet.txt(a-z, ') only.

how can i add numbers and some special characters in my alphabet.txt. if we add numbers in alphabet.txt, it is very complex. how can i handle numbers in my large vocab.txt(15 mb).

i was trouble for handling a large vocab.txt to build a language model. thank you sir.

lissyx · October 3, 2018, 10:37am

For numbers, so far, we just transform them to text. Like 20 = twenty

muruganrajenthirean · October 3, 2018, 10:41am

but if we have a lot more numbers in a txt file, it is very complex. sir fix this issue. because our deepspeech works very well. but this kind of issue reduce the peoples interest. thank you sir.

elpimous_robot · October 3, 2018, 8:26pm

Hi @muruganrajenthirean.
I saw, on the web, a python file, to automate the process of replacing numbers with corresponding text (as Lissyx said)
Searching for U…wait

lissyx · October 3, 2018, 12:06pm

As @elpimous_robot said, search, there are python code to do that.

elpimous_robot · October 3, 2018, 12:21pm

Here : https://github.com/savoirfairelinux/num2words

muruganrajenthirean · October 3, 2018, 4:07pm

@elpimous_robot , @lissyx thank you so much sir.

palash.shinde · April 10, 2019, 9:18am

Hi @lissyx , @kdavis , @elpimous_robot

I am following how-i-trained-a-specific-french-model tutorial and through /bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3 & /bin/bin/./build_binary -T -s words.arpa lm.binary.

I was able to genrate words.arpa and lm.binary. Also my vocabulary.txt contains all the text in UPPER CASE.

In order to genate these (.apra , .binary) files it was advice to follow https://github.com/mozilla/DeepSpeech/tree/master/data/lm README.md

Doubt:-

If I am able to generate .apra & .binary files through above command of lmplz and build_binary mentioned in french tutorial link, respectively with UPPER CASE text (which my vocabulary.txt)

Then from data/lm/README.md

a. ) Why are converting the whole text to Lowercase ? (Given that i was able to generate .binary file from my UPPER CASE text in vocabulary.txt)

b. ) Is it compulsion to convert the UPPER CASE text to lower case? Will this have any repercussion going forward if the whole upper case text in vocabulary.txt is not converted into lower case?

c. ) Also could you please care to explain -memory & -prune ? defined as per below

!lmplz --order 5
–temp_prefix /tmp/
–memory 50%
–text {data_lower}
–arpa {lm_path}
–prune 0 0 0 1
d ) Also could you please care to explain -a 255 & -q 8 defined as per below

!build_binary -a 255
-q 8
trie
{lm_path}
{binary_path}

I checked the official kheafield portal but couldn’t get through these.

dvz · April 10, 2019, 12:34pm

Lowercasing is a normalization step, because in pronounciation there’s no difference between lower and uppercase letters.

palash.shinde · April 10, 2019, 12:41pm

Hi dvz,

Thanks for reverting back.

Could you please revert back on c.) and d.) queries as well. I fail to understand what these params are doing

lissyx · April 11, 2019, 1:53pm

Read the KenLM doc, it’s explained there.