How to Strict the output to the Language Model only?

Hello Community.

I have trained an arabic model, and managed to get WER or around 0.35. My data set is still small (~40 hours) and I’m working on collecting more data and data augmentation.

Meanwhile, I see some strange output text.

  1. Two correct words having no interspace
  2. Very weird letters that look like nothing, rubbish!

As far as I understand, the language model is used for beam search and defines the output text. Why is the output not restricted to vocabulary from the LM? Is there a switch for that?

+1, I was on the same boat. But i was training on Chinese language data-set. What made me ressolve (that might not be applicable in your language), i seperate each character by space and build the language model with 4 gram. But, still its strange to not having space between each words, regardless of having space in alpabets.txt and offcourse vocabulary to build language model.

1 Like

@tarekeldeeb

In particular the trie indicates which words are valid and valid_word_count_weight indicates the relative weight given to the trie results.

@kdavis

  • Yes, I created my language model, with guarantee that all spoken words are included.
  • Yes, I created my trie
  • No I did not adjust any weights, I think this is automatically built and included by building a 4-gram.

I can see clear text in (.arpa) as expected. In the (.trie) I see a pattern like:

-1
21
2633
-4.63527
-1
-1
-1
-1
-1
-1
-1
-1
-1
1
14574
-4.63527
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1

What have I missed?

I think you are just hitting https://github.com/mozilla/DeepSpeech/issues/1156

The trie determines what is “in vocabulary” and the valid_word_count_weight determines how much importance should be given to the trie’s opinion on what is in and what is not in vocabulary.

So, in particular, increasing the value of valid_word_count_weight should decrease the occurrence of out of vocabulary words as defined by the trie.

The weights I mentioned lm_weight, word_count_weight, and valid_word_count_weight are external to the language model and are not part of the language model weights which, as you mention, are built automatically.

I hope that’s a bit clearer?

1 Like

Yes, thanks a lot.

I have started playing around with those weights.

Regards,