Issue with Language Model

I am using existing deepspeech model currently. As I try using this command (without a LM):

deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --audio output.wav

I get the output as:
not impresecriptson crusing twice taily for three days

Using deepspeech LM:
deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio output.wav

I get:

not in presecriptsoncrusingtwicetailyforthreedays

Using a model I generated from wikipedia dump, using KenLM, and the binary being created in the Trie Structure and not Probing Structure https://kheafield.com/code/kenlm/structures/ , as Trie takes smaller memory and the .arpa file was some 68GB sized. However, using my wikipedia model and Deepspeech audio model:

deepspeech --model models/output_graph.pbmm --alphabet wiki_model/alphabet.txt --lm wiki_model/lm.binary --trie wiki_model/trie --audio output.wav

I get something gibberish:

/#'zyxwvutsrqponm/#'zyxwvutsrqponm/#'zyxwvutsrqponm/#'zyxwvutsrqponm/#'zyxwvutsrqboam/#'rqxwvtsr’zonm’zyxwjujseqgonm/#'z nwcutgripoem/#'zyxjvufsrqponm/#'zyxwvutsrqponm/#'zxwaujirjpond/#'qyxwvuasryxonm/#'zyxwulsmqponj/#bcyxvulsrspo’ml’zvxwvutsrwvnmo’pyxwul/#rgmotm/#'y wvwtsrngfem/#'zyxwvutsrqponm/#'zyxwvu

I don’t know why that’s so

The real output is:
“Note in prescription crocin twice daily for three days”

Kindly help why my own wiki model is doing what it is doing.

What process did you use to generate the language model?

Was it similar to that described here?

It was quite simialr. Created the LM using these steps indeed

#Next lines are a repeat:

  1. Clean the dataset
  2. …/kenlm/build/bin/lmplz -o 4 <complete_wiki.txt >lm.arpa
  3. …/kenlm/build/bin/build_binary -T /home/sayantan trie lm.arpa lm.binary
  4. ./DeepSpeech-0.3.0/native_client/generate_trie ./wiki_model/alphabet.txt ./wiki_model/lm.binary ./wiki_model/trie

The trie file is just about 195. MB in size, although the Binary file was about 19 GB in size. (ARPA file is about 68 GB)

I would like to authenticate the trie file once. Is there any utility tool for the same?

Hi, @kdavis

I am copy pasting the top 100 files of the lm.arpa files from which the binary file was generated

\data
ngram 1=9097879
ngram 2=154573300
ngram 3=594629001
ngram 4=1105510693

\1-grams:
-8.228964 0
0 -1.3960606
-1.8965266
0
-5.4005637 anarchism -0.53406227
-2.0593514 is -1.4193866
-2.4595675 a -1.1286751
-3.8110898 political -1.0689924
-4.3054976 philosophy -0.7460401
-2.8637798 that -1.0562168
-4.524627 advocates -0.6802824
-6.27761 selfgoverned -0.2877215
-4.5711317 societies -0.7481196
-3.4280145 based -0.98108906
-2.6432605 on -1.062103
-4.7492385 voluntary -0.56996393
-4.498888 cooperative -0.67753214
-4.421491 institutions -0.7604111
-4.8547673 rejecting -0.52288944
-5.402689 unjust -0.42991102
-4.7857447 hierarchy -0.6830885
-3.266868 these -1.0719532
-2.752602 are -1.2063406
-3.5015023 often -0.99296844
-3.5689614 described -0.7220998
-2.5816905 as -0.8370095
-5.6543865 stateless -0.44293898
-3.2873464 although -0.834417
-3.613686 several -0.9010781
-4.437504 authors -0.64318
-3.0845566 have -1.1816087
-4.2122087 defined -0.67623913
-3.9611819 them -0.7865658
-3.4904897 more -0.9597376
-3.9539707 specifically -0.7138097
-6.1262937 nonhierarchical -0.24206696
-2.5993662 or -0.83110285
-3.9011633 free -0.7803877
-4.4626055 associations -0.67566454
-3.853204 holds -0.93136865
-5.0178266 capitalism -0.58022237
-2.226151 the -1.1315541
-3.6163027 state -0.9375776
-2.0080698 and -1.2195712
-4.2242355 representative -0.69352305
-4.6504893 democracy -0.63682246
-2.4941974 to -1.1450372
-3.7650115 be -0.65511733
-5.306855 undesirable -0.41708732
-4.9310794 unnecessary -0.48189163
-5.0813923 harmful -0.5378032
-3.0706265 while -0.9305055
-4.2908716 opposition -0.76580906
-3.831589 central -0.7538493
-5.053821 entails -0.44765753
-4.6252303 opposing -0.50951856
-4.301204 authority -0.7473136
-5.1222243 hierarchical -0.47510737
-4.3605022 organisation -0.6749272
-2.202462 in -1.2162503
-4.5127544 conduct -0.5776099
-2.553592 of -0.9748639
-3.2563536 all -0.96405226
-3.9525073 human -0.80157
-4.0924478 relations -0.82099473
-3.6751373 usually -0.90123725
-3.8483024 considered -0.65798557
-5.871972 farleft -0.35024333
-4.6817417 ideology -0.6915243
-3.7641618 much -0.9017955
-5.0117226 anarchist -0.50772643
-4.6886826 economics -0.62269944
-4.089365 legal -0.92563426
-4.411739 reflects -0.7342739
-6.0320745 antiauthoritarian -0.25254428
-4.7506604 interpretations -0.78226703
-5.0892572 communism -0.54386073
-5.897835 collectivism -0.40263453
-5.978097 syndicalism -0.41727173
-5.893416 mutualism -0.46746132
-5.2635293 participatory -0.5440003
-3.5366023 does -1.0248679
-3.412422 not -0.8935474
-4.138037 offer -0.6300868
-4.4087763 fixed -0.6287984
-3.7950757 body -0.8366749
-4.5838757 doctrine -0.7661688
-2.724559 from -0.95169467
-3.9233131 single -0.5953478
-4.551088 particular -0.4081802
-3.730775 world -0.7725034
-4.04036 view -0.69692856
-3.5223584 instead -0.8923168
-6.6504493 fluxing -0.2506195

The line between start and end sentence token “s” and “/s” inside “<>” results in cut-through in the mark-up language

And the first 4 lines of the vocabulary file

anarchism
anarchism is a political philosophy that advocates selfgoverned societies based on voluntary cooperative institutions rejecting unjust hierarchy these institutions are often described as stateless societies although several authors have defined them more specifically as institutions based on nonhierarchical or free associations anarchism holds capitalism the state and representative democracy to be undesirable unnecessary and harmful
while opposition to the state is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations anarchism is usually considered a farleft ideology and much of anarchist economics and anarchist legal philosophy reflects antiauthoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics
anarchism does not offer a fixed body of doctrine from a single particular world view instead fluxing and flowing as a philosophy many types and traditions of anarchism exist not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications

Any help would be highly beneficial. Thank you.

It seems like you’re not using the options -a and -q. They are defined as follows…

-q turns quantization on and sets the number of bits (e.g. -q 8).
-a compresses pointers using an array of offsets. The parameter is the
maximum number of bits encoded by the array. Memory is minimized subject
to the maximum, so pick 255 to minimize memory.

So the created lm.binary should not be compatible with DeepSpeech. I’m surprised that in loading the model no errors to that effect were logged.

We load the LM as a QuantArrayTrieModel which should handle both quantization and pointer compression…

The weird characters in the output are very common in MediaWiki markup, maybe the problem is in cleaning up the Wikipedia text before feeding it into KenLM?

@reuben Thanks a lot for the reply. I used the following line to remove unexpected characters:

item = (item.translate(item.maketrans('','',string.punctuation))).lower()

Where item is an iteration of lines in the text file (I had splits of wiki, using the WikiExtractor.py and then merged all files).

And I also did:

def replace_numeric(text, numeric_pattern=re.compile('[0-9]+'), digit_pattern=re.compile('[0-9]'), repl='#', by_single_digit=False):
    return re.sub(numeric_pattern, repl, text) if by_single_digit else re.sub(digit_pattern, repl, text)

The above code converts all numericals into “#”

Now, I see the issue with my alphabets.txt file.

In the file i included this special character:

"/#"

Removing this character and generating the trie and thereafter using the same results in the output with a LM and without a LM exactly the same (that’s what I did over the last 3 hours :P)

deepspeech --model models/output_graph.pbmm --alphabet wiki_model/alphabet.txt --lm wiki_model/lm_new.binary --trie_new wiki_model/trie --audio output8.wav 

Output:

not impresecriptson crusing twice taily for three days

Without LM:

deepspeech --model models/output_graph.pbmm --alphabet wiki_model/alphabet.txt --audio output8.wav

Output:

not impresecriptson crusing twice taily for three days

Expected output:

note in prescription crocin twice daily for three days

" impresecriptson" and “taily” are not even words, how are they coming in the output…!!

Yeah, you definitely cannot change the alphabet file unless you’re retraining from scratch.

It took 4 hours to read through the file and there are a lot of weird characters in Wikipedia dump… Now I’ll retry creating a LM removing 100s of wierd characters in the language model

Still, how to include the special character “#” denoting numericals in the LM?