Adding an unknown word tag when training a model

user13789 · March 2, 2018, 1:04am

In my training data I have a few samples where the speaker code switches to a word in another language that I don’t want to train on. Can I add a tag such so that word will be ignored when the model is trained?

train.csv
./audio/sample1.wav, 5000, I went to the restaurant and <UNK> we ordered off the menu

I’m wondering how this can happen as the CTC loss function works at the character level. If not I’ll just create two training samples around that word or remove those samples. Any help is appreciated!

kdavis · March 2, 2018, 3:20pm

Does the word in the foreign language use the same character set as English? If so I’d just leave the word in.

As you say, CTC works on character level. So it will try and spell out any foreign words as an English speaker would.

Also, the trie, which stores “known words”, and its relative weight valid_word_count_weight can be used to tune the model a bit in this regard. Guiding it to preferentially/or not generate “known words”.

user13789 · March 2, 2018, 4:43pm

Thanks for the reply! The foreign language and English characters set overlap by about half. I took a look at the pretrained model’s trie file would I find out the foreign languages words in that and zero out it’s value?

Topic		Replies	Views
How to classify unknown words, how to ignore words DeepSpeech	9	2605	January 16, 2018
Pretrained Model cannot provide accurate English words DeepSpeech	18	952	November 8, 2018
Language Model during training effect DeepSpeech	6	1332	August 15, 2019
Blank res when building/training my own model and vocabulary DeepSpeech	9	798	August 13, 2019
What are all the English language characters that I can use in my custom dataset to further train the STT model? DeepSpeech dataset	6	700	December 4, 2020

Adding an unknown word tag when training a model

Related topics