I have trained an arabic model, and managed to get WER or around 0.35. My data set is still small (~40 hours) and I’m working on collecting more data and data augmentation.
Meanwhile, I see some strange output text.
Two correct words having no interspace
Very weird letters that look like nothing, rubbish!
As far as I understand, the language model is used for beam search and defines the output text. Why is the output not restricted to vocabulary from the LM? Is there a switch for that?
+1, I was on the same boat. But i was training on Chinese language data-set. What made me ressolve (that might not be applicable in your language), i seperate each character by space and build the language model with 4 gram. But, still its strange to not having space between each words, regardless of having space in alpabets.txt and offcourse vocabulary to build language model.
The trie determines what is “in vocabulary” and the valid_word_count_weight determines how much importance should be given to the trie’s opinion on what is in and what is not in vocabulary.
So, in particular, increasing the value of valid_word_count_weight should decrease the occurrence of out of vocabulary words as defined by the trie.
The weights I mentioned lm_weight, word_count_weight, and valid_word_count_weight are external to the language model and are not part of the language model weights which, as you mention, are built automatically.