Changing alphabet.txt for the Language Model


(SGang) #1

I’m using ore trained Deepspeech v0.3 for my use case. I’m trying to use wikipedia dump as the language model. Hence, I’m doing the following:

  1. …/kenlm/build/bin/lmplz -o 4 -T /home/sayantan <wiki_dimp.txt>lm_new.arpa
  2. …/kenlm/build/bin/build_binary trie -T /home/sayantan lm_new.arpa lm_new.binary
  3. ./DeepSpeech/native_client/generate_trie ./wiki_model/alphabet_new.txt ./wiki_model/lm_new.binary ./wiki_model/trie_new

What I did is I changed the alphabet.txt file a bit. Does that cause problem in decoding? (probably it does),

And I’m getting gibberish output, using LM and not the most perfect output without using LM.

Can anyone confirm on the “alphabet” file and whether there’s something missing in the steps?

Thanks.


(kdavis) #2

It does cause a problem the output nodes of the model correspond to the letters of alphabet.txt so you’re essentially randomly shuffling the alphabet when you change alphabet.txt.

What’s likely a better idea is to remove all letters not in alphabet.txt from the wiki dump, i.e. put everything to lowercase + remove all “:”, “;”… then create a language model from the new wiki text and use the default acoustic model.


(SGang) #3

I’ve just kept only letters (lowercase) and whitespace and initially tried to add “#” as a special character. However, finally even"#" has been removed and only 26 letters + whitesapce+new-line is in the list of alphabets in the whole dump.

The original alphabet.txt file has:
"’ ’ "

I ended up removing this as it’s not present.

I shall do as you suggest and keep the single quote in alphabet.txt (although it’s not present in the processed dump) and retry.

Another doubt is, the trie file generated is around 180 MB from a binary file of 18 GB. Is that right? (there’s no error being thrown) Or should the Trie file be bigger?