Adding an unknown word tag when training a model


In my training data I have a few samples where the speaker code switches to a word in another language that I don’t want to train on. Can I add a tag such so that word will be ignored when the model is trained?

./audio/sample1.wav, 5000, I went to the restaurant and <UNK> we ordered off the menu

I’m wondering how this can happen as the CTC loss function works at the character level. If not I’ll just create two training samples around that word or remove those samples. Any help is appreciated!

(kdavis) #2

Does the word in the foreign language use the same character set as English? If so I’d just leave the word in.

As you say, CTC works on character level. So it will try and spell out any foreign words as an English speaker would.

Also, the trie, which stores “known words”, and its relative weight valid_word_count_weight can be used to tune the model a bit in this regard. Guiding it to preferentially/or not generate “known words”.


Thanks for the reply! The foreign language and English characters set overlap by about half. I took a look at the pretrained model’s trie file would I find out the foreign languages words in that and zero out it’s value?