LM + TRIE performance

Hi,

So I’m adding here more information on this topic: https://github.com/mozilla/DeepSpeech/issues/1407#issuecomment-398880456

If I use my newly trained language model with DeepSpeech’s trie model file I’m getting more accurate transcriptions rather than If I use my custom lm with a newly generated trie model file. I’m not sure about this behavior. Any thoughts?

Adding in example from this video: https://www.youtube.com/watch?v=JRAdZyW_hss (0:00 to 2:00). I have highlighted words that are different in both transcript. I’m using same DeepSpeech 0.1.1 acoustic model with different language model.

New LM + DeepSpeech TRIE: WER: 37.662% (145 / 385)

he is its manda and say i’m knowing to show you have removed a red wine stain for my carpet i able did not think it would come out and it totally did and i’m super excited um because i just had talked it up and decided that the carpet was totally ruined and i just go gold some ideas on how to get rid one now and im going to kind of explain to you the steps that i did and i really impressed with hot work a so this is the red wine stain on the carpet and you can see it s very very purple our carpet is super old and i to place if i can get a sound how can be devastated but im all i have done at this point is last night i stole the wine and i just got a wet wash cloth and kind of dab and patted at it to soak up the excess and kinda let i get wet i did not clean it last night um because unfortunately if you happen enough fund that your spilling wine on the carpet you re probably not in a cleaning up wine stain state ah so that was my situation so i’m gonna try it today my first method is going to be with peroxide and baking so i m starting with bro feta baking so dt a because i just happen have to of these on hand and it was one of the solutions i read on line so i said to spray it with hide jem fer suf pro xian did on t have an empty spray bottles on is gonna condo por it and saturate it with an the hind rd jim pro x kid ed then it said to im kino spread the baking soda over top of it and let it sit for two to three minutes and then just clean it up so thats what i m going to dry it try first and we’ll see the kind of difference it make so there s my before li tings naw the best in here hopefully i will have a better after a so here is my proxide and being so a spot treatment jus kana f sitting on n the carpet i really don’t think this is going to do anything i thought i it is or some thing when i

NEW LM + NEW TRIE WER: 46.494% (179 / 385)

he is its manda and say i’m knowing to show you have removed a red wine stain for my carpet i able did not think it would come out and it totally did and i’m super excited um because i just had talked it up and decided that the carpet was totally ruined and i just do good some ideas on how to get rid one ow and i’m gonna kind of explain to you the steps that i did and i really impressed with h out work a so this is the red wine stain on the carpet and you can see it s very very purple our carpet ids epr old inou io places if a can k get hso u and how t can be devastated but i mall i have done at this point is last night i stole the wine and i just got a wet wash cloth and kind of dab and patted at it to soak up the excess and kinda let i get wet i did not clean it last night um because unfortunately if you happen enough fund that your spilling wine on the gar pet you re probly not in a cle a ming up wy ne staind state a hso that was my situation so i’m gonna try it today my first method is going to be with peroxide amba kings odon starting with bro feta baking so dt a because ti just happen tham to s of these on he nd and it was one of the solutions i read on line so it said to spray it with hide jem fer suf pro xian did on t have an empty spray bottles on is gonna condo por it and satu raed it with om aid to im kina spread the baking so a over top of it n let its it for two to three minutes and then just clean it up so that s what i m going to dry try first and we’ll see the kind of difference it make so there s my before w lit dings naw the best in here hopefully i will have a better after a so here is my pr oxide an nd be a king soda spot treat men jus kana f sitting on n the par pet i really don’t think this is going to do anything i thought i it d is z orr something whan i

I guess it might lies within how you produced this new stuff :-). Can you share more details on your setup ?

BTW, the current view of your comparison is barely readable :slight_smile:

I’m using client.py without any modifications in constant parameters and DeepSpeech 0.1.1 acoustic model. Do you need any information on any specific part?

Yes, you document “new LM” and “new TRIE”, you should document how they were generated

New LM: using kenlm

(1) ./lmplz
-o 3
-S 85%
-T /tmp/
–prune 0 10 20
–text /opt/data/streaming/kafka.en_only.txt
–arpa /opt/deepspeech_dir/LM/en_only/lm.arpa

lm.arpa ngrams

\data
ngram 1=30852030
ngram 2=213357899
ngram 3=3411752

(2) ./build_binary
/opt/deepspeech_dir/LM/en_only/lm.arpa
/opt/deepspeech_dir/LM/en_only/lm.binary

NEW TRIE: using generate_trie from DeepSpeech native_client

(1) ./generate_trie
/opt/deepspeech_dir/LM/data/en_only/alphabet.txt
/opt/deepspeech_dir/LM/en_only/lm.arpa
/opt/data/streaming/kafka.en_only.txt
/opt/deepspeech_dir/LM/lm/en_only/lm.trie

Good, what’s that kafka.en_only.txt ?

It’s the clean english sentences for creating language model.

Sample

his arizona visit with a brief trip to the southern edge
a pardon for former sheriff joe arpaio asking the crowd
alone adjective adverb american english definition and synonyms macmillan dictionary
to condemn the violent bigotry of the nazi and kkk demonstrators
francisco kjolseth the salt lake tribune artist angela johnson whose ongoing
bella hadid s momma detailed this troubling moment in her soon-to-be released memoir
a former roommate of van der sloot s best friend told
how to adjust reading settings in ibooks for iphone and ipad imore
how to adjust reading settings in ibooks for iphone and ipad
north korean photos shows diagrams that suggest missiles are in development

I know it is the source material to produce the language model, I’m asking what is its content exactly, how you sourced it, etc.

It’s the proprietary text data that mostly comes from world wide web; more like news articles, rss feed and twitter.

How much data is that ? FTR, recently @kdavis spent quite some time re-building a new language model from scratch from material that we can properly redistribute. It took a lot of steps and comparative benchmarking before finding the proper set of parameters between:

  • source data,
  • language model build parameters

One example in your command line that strikes me is the ordering: you seem to use some 3-gram, while we use 4-gram (on the new language model we added, I cannot remember for v0.1.1)

For the mentioned LM; I’m using 8GB of data with min_sentence_length = 10

Could you give a try to current v0.2.0a6 binaries and data/lm/lm.binary + data/lm/trie from Git repo, using the 0.1.1 language model ? This way we would have a rather sane base of comparison with dataset you can reproduce.

Sure, I will try the v0.2.0a6 and update this thread. Thanks for the replies :slightly_smiling_face:

Another thing, it’d be great since your compare WER depending on the LM + TRIE you are using that you document the values you get with default v0.1.1 release, so we have this list:

  • v0.1.1 LM + v0.1.1 TRIE
  • NEW LM + v0.1.1 TRIE
  • NEW LM + NEW TRIE

The new language model and trie file we produced should have been built out of transcripts from the LibriVox dataset.