How can i add custom vocab.txt and build a language model lm.binary, trie for pretrained model v0.2.0

How can i add custom vocab.txt and build a language model lm.binary, trie for pretrained model v0.2.0.

Sir now that pretrained model vocab.txt(data/lm/vocab.txt) and i added my own vocab.txt for the same format.and i started to convert lm model, trie.

result: throws error

_…/…/new_native_client/kenlm/build/bin/lmplz -o 5 <vocab.txt >lm.arpa
_
=== 1/5 Counting and sorting n-grams ===
Reading /home/dell/Music/12-09-2018/DeepSpeech/data/own_lmm/vocab.txt
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Unigram tokens 974571 types 973693
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684316 2:641120832 3:1202101632 4:1923362432 5:2804903936
/home/dell/Music/12-09-2018/DeepSpeech/new_native_client/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0’.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 3 because we didn’t observe any 1-grams with adjusted count 2; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback

…/…/new_native_client/kenlm/build/bin/build_binary lm.arpa lm.binary
Reading lm.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
End of file Byte: 0

I search this same issue, solution was not clear,

how can i combine vocab and build a custom LM sir.

Can you document exactly your steps ? Are you following the data/lm/README.md documentation ? I did it no later than yesterday, no problem at all.

No sir. i am following that README.md. but it is only download, convert lower case and build a language model for librispeech-lm-norm.txt (4.3 GB (4,28,72,16,164 bytes)). but if i add my own vocab.txt., it is very complex. then how can i add this.

next i am following this instructions,

TUTORIAL : How I trained a specific french model to control my robot

but it was throws this error

Then be more clear in describing what you do. You need to follow the steps documented in data/lm/README.md, but you have to adjust to add your own data. You will still need the data documented, however.

sorry sir i can’t understand what you said.

Adding vocab.txt + own vocab.txt(data/own_lm ):

Reference:

build language model:

(deepspeech-alpha08) dell@dell-OptiPlex-5050:~/Music/12-09-2018/DeepSpeech$ cd new_native_client/

step 1:git clone https://github.com/kpu/kenlm.git
step 2:cd kenlm
mkdir build
cd build
cmake …
step 3:

– To install Eigen3 in your home directory, copy paste this:
cd $HOME
wget -O - https://bitbucket.org/eigen/eigen/get/3.2.8.tar.bz2 |tar xj
rm CMakeCache.txt
export EIGEN3_ROOT=$HOME/eigen-eigen-07105f7124f9

– Configuring done
– Generating done
– Build files have been written to: /home/dell/Music/12-09-2018/DeepSpeech/new_native_client/kenlm/build

step 4:make -j 4
step 5:copy and paste alphabet.txt (data/own_lm)
step 6:…/…/new_native_client/kenlm/build/bin/lmplz -o 5 <vocab.txt >lm.arpa
step 7:…/…/new_native_client/kenlm/build/bin/build_binary lm.arpa lm.binary
step 8:…/…/new_native_client/generate_trie alphabet.txt lm.binary vocab.txt trie

i follow this instruction. previously it is working fine(vocab.txt- pretrained v0.1.0).
that the reason only i follw this instruction sir.

thank you so much for your reply sir.:slightly_smiling_face::slightly_smiling_face::slightly_smiling_face:

Well, sorry, but at that point, I cannot do it for you. You need to combine lower.txt with your own words, and use the resulting file to produce a new language model, as documented.

yes really sir

need to combine lower.txt with your own words

thank you sir i will do it.:slightly_smiling_face::slightly_smiling_face::slightly_smiling_face:

@lissyx sir

sir what can i do numbers presents in my large vocab.txt.
alphabet.txt(a-z, ') only.

how can i add numbers and some special characters in my alphabet.txt. if we add numbers in alphabet.txt, it is very complex. how can i handle numbers in my large vocab.txt(15 mb).

i was trouble for handling a large vocab.txt to build a language model. thank you sir.

For numbers, so far, we just transform them to text. Like 20 = twenty

but if we have a lot more numbers in a txt file, it is very complex. sir fix this issue. because our deepspeech works very well. but this kind of issue reduce the peoples interest. thank you sir.

Hi @muruganrajenthirean.
I saw, on the web, a python file, to automate the process of replacing numbers with corresponding text (as Lissyx said)
Searching for U…wait

As @elpimous_robot said, search, there are python code to do that.

Here : https://github.com/savoirfairelinux/num2words

@elpimous_robot , @lissyx thank you so much sir.:blush::slightly_smiling_face:

Hi @lissyx , @kdavis , @elpimous_robot

I am following how-i-trained-a-specific-french-model tutorial and through /bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3 & /bin/bin/./build_binary -T -s words.arpa lm.binary.

I was able to genrate words.arpa and lm.binary. Also my vocabulary.txt contains all the text in UPPER CASE.

In order to genate these (.apra , .binary) files it was advice to follow https://github.com/mozilla/DeepSpeech/tree/master/data/lm README.md

Doubt:-

If I am able to generate .apra & .binary files through above command of lmplz and build_binary mentioned in french tutorial link, respectively with UPPER CASE text (which my vocabulary.txt)

Then from data/lm/README.md

a. ) Why are converting the whole text to Lowercase ? (Given that i was able to generate .binary file from my UPPER CASE text in vocabulary.txt)

b. ) Is it compulsion to convert the UPPER CASE text to lower case? Will this have any repercussion going forward if the whole upper case text in vocabulary.txt is not converted into lower case?

c. ) Also could you please care to explain -memory & -prune ? defined as per below

!lmplz --order 5
–temp_prefix /tmp/
–memory 50%
–text {data_lower}
–arpa {lm_path}
–prune 0 0 0 1
d ) Also could you please care to explain -a 255 & -q 8 defined as per below

!build_binary -a 255
-q 8
trie
{lm_path}
{binary_path}

I checked the official kheafield portal but couldn’t get through these.

Lowercasing is a normalization step, because in pronounciation there’s no difference between lower and uppercase letters.

Hi dvz,

Thanks for reverting back.

Could you please revert back on c.) and d.) queries as well. I fail to understand what these params are doing

Read the KenLM doc, it’s explained there.

1 Like