Changing alphabet.txt for the Language Model

sayantangangs.91 · January 8, 2019, 4:45am

I’m using ore trained Deepspeech v0.3 for my use case. I’m trying to use wikipedia dump as the language model. Hence, I’m doing the following:

…/kenlm/build/bin/lmplz -o 4 -T /home/sayantan <wiki_dimp.txt>lm_new.arpa
…/kenlm/build/bin/build_binary trie -T /home/sayantan lm_new.arpa lm_new.binary
./DeepSpeech/native_client/generate_trie ./wiki_model/alphabet_new.txt ./wiki_model/lm_new.binary ./wiki_model/trie_new

What I did is I changed the alphabet.txt file a bit. Does that cause problem in decoding? (probably it does),

And I’m getting gibberish output, using LM and not the most perfect output without using LM.

Can anyone confirm on the “alphabet” file and whether there’s something missing in the steps?

Thanks.

kdavis · January 8, 2019, 5:14am

It does cause a problem the output nodes of the model correspond to the letters of alphabet.txt so you’re essentially randomly shuffling the alphabet when you change alphabet.txt.

What’s likely a better idea is to remove all letters not in alphabet.txt from the wiki dump, i.e. put everything to lowercase + remove all “:”, “;”… then create a language model from the new wiki text and use the default acoustic model.

sayantangangs.91 · January 8, 2019, 5:22am

I’ve just kept only letters (lowercase) and whitespace and initially tried to add “#” as a special character. However, finally even"#" has been removed and only 26 letters + whitesapce+new-line is in the list of alphabets in the whole dump.

The original alphabet.txt file has:
"’ ’ "

I ended up removing this as it’s not present.

I shall do as you suggest and keep the single quote in alphabet.txt (although it’s not present in the processed dump) and retry.

Another doubt is, the trie file generated is around 180 MB from a binary file of 18 GB. Is that right? (there’s no error being thrown) Or should the Trie file be bigger?

Topic		Replies	Views
Issue with Language Model DeepSpeech	11	1054	January 3, 2019
[Solved] Help with custom language model, output is gibberish DeepSpeech	4	564	April 9, 2020
How to update the language model DeepSpeech	25	2498	April 19, 2019
How language model is used in deepspeech DeepSpeech	5	8325	February 26, 2018
Creation of language model and trie DeepSpeech	28	12814	August 7, 2019

Changing alphabet.txt for the Language Model

Related topics