Unigram tokens 974571 types 973693
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684316 2:641120832 3:1202101632 4:1923362432 5:2804903936
/home/dell/Music/12-09-2018/DeepSpeech/new_native_client/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0’.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 3 because we didn’t observe any 1-grams with adjusted count 2; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback
…/…/new_native_client/kenlm/build/bin/build_binary lm.arpa lm.binary
Reading lm.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
End of file Byte: 0
I search this same issue, solution was not clear,
how can i combine vocab and build a custom LM sir.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Can you document exactly your steps ? Are you following the data/lm/README.md documentation ? I did it no later than yesterday, no problem at all.
No sir. i am following that README.md. but it is only download, convert lower case and build a language model for librispeech-lm-norm.txt (4.3 GB (4,28,72,16,164 bytes)). but if i add my own vocab.txt., it is very complex. then how can i add this.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
Then be more clear in describing what you do. You need to follow the steps documented in data/lm/README.md, but you have to adjust to add your own data. You will still need the data documented, however.
– To install Eigen3 in your home directory, copy paste this:
cd $HOME
wget -O - https://bitbucket.org/eigen/eigen/get/3.2.8.tar.bz2 |tar xj
rm CMakeCache.txt
export EIGEN3_ROOT=$HOME/eigen-eigen-07105f7124f9
– Configuring done
– Generating done
– Build files have been written to: /home/dell/Music/12-09-2018/DeepSpeech/new_native_client/kenlm/build
i follow this instruction. previously it is working fine(vocab.txt- pretrained v0.1.0).
that the reason only i follw this instruction sir.
thank you so much for your reply sir.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Well, sorry, but at that point, I cannot do it for you. You need to combine lower.txt with your own words, and use the resulting file to produce a new language model, as documented.
sir what can i do numbers presents in my large vocab.txt.
alphabet.txt(a-z, ') only.
how can i add numbers and some special characters in my alphabet.txt. if we add numbers in alphabet.txt, it is very complex. how can i handle numbers in my large vocab.txt(15 mb).
i was trouble for handling a large vocab.txt to build a language model. thank you sir.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
For numbers, so far, we just transform them to text. Like 20 = twenty
but if we have a lot more numbers in a txt file, it is very complex. sir fix this issue. because our deepspeech works very well. but this kind of issue reduce the peoples interest. thank you sir.
Hi @muruganrajenthirean.
I saw, on the web, a python file, to automate the process of replacing numbers with corresponding text (as Lissyx said)
Searching for U…wait
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
12
As @elpimous_robot said, search, there are python code to do that.
I am following how-i-trained-a-specific-french-model tutorial and through /bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3 & /bin/bin/./build_binary -T -s words.arpa lm.binary.
I was able to genrate words.arpa and lm.binary. Also my vocabulary.txt contains all the text in UPPER CASE.
If I am able to generate .apra & .binary files through above command of lmplz and build_binary mentioned in french tutorial link, respectively with UPPER CASE text (which my vocabulary.txt)
Then from data/lm/README.md
a. ) Why are converting the whole text to Lowercase ? (Given that i was able to generate .binary file from my UPPER CASE text in vocabulary.txt)
b. ) Is it compulsion to convert the UPPER CASE text to lower case? Will this have any repercussion going forward if the whole upper case text in vocabulary.txt is not converted into lower case?
c. ) Also could you please care to explain -memory & -prune ? defined as per below
!lmplz --order 5
–temp_prefix /tmp/
–memory 50%
–text {data_lower}
–arpa {lm_path}
–prune 0 0 0 1
d ) Also could you please care to explain -a 255 & -q 8 defined as per below
!build_binary -a 255
-q 8
trie
{lm_path}
{binary_path}
I checked the official kheafield portal but couldn’t get through these.