It does cause a problem the output nodes of the model correspond to the letters of alphabet.txt so you’re essentially randomly shuffling the alphabet when you change alphabet.txt.
What’s likely a better idea is to remove all letters not in alphabet.txt from the wiki dump, i.e. put everything to lowercase + remove all “:”, “;”… then create a language model from the new wiki text and use the default acoustic model.
I’ve just kept only letters (lowercase) and whitespace and initially tried to add “#” as a special character. However, finally even"#" has been removed and only 26 letters + whitesapce+new-line is in the list of alphabets in the whole dump.
The original alphabet.txt file has:
"’ ’ "
I ended up removing this as it’s not present.
I shall do as you suggest and keep the single quote in alphabet.txt (although it’s not present in the processed dump) and retry.
Another doubt is, the trie file generated is around 180 MB from a binary file of 18 GB. Is that right? (there’s no error being thrown) Or should the Trie file be bigger?