Does it Matter to training data from various types of sources and LM of one business vertical

Kunal.botify · March 11, 2020, 2:42pm

Hi,

I am building STT for a company which is into education sector. We have around 700 hours of corrected training data from education sector which is very relevant to the type of conversation which will be happening.
To increase my vocabulary, can i put corrected set from youtube which can be generic and 2-3 from business verticals like finance and medical.
Total will be around 2000 hours. I intend to keep the LM from education vertical only.
Will this degrade inference performance of my model or enhance it.
Is there any other approach around it so that i can create a model and in future for different business verticals i keep on changing the LM.

Please advice or point me to any resource which will be helpful

Thanks in advance

victornoriega7 · March 11, 2020, 3:08pm

I think your performance will improve since your acoustic model will get better with more data. You can also use the generic data that is all around in the net, like LibriSpeech, LibriVox, etc…

If you change your environment to one more related to business, you should keep all the data you can. If your acoustic model is successfully, I’d advise you to just change you LM.

Kunal.botify · March 11, 2020, 3:45pm

Hi @victornoriega7,

Thanks for your revert, i totally buy your argument. But isn’t it the acoustic will get confused considering some similarity between different types of words which happen when a business changes…

Our should we create different models with more proportion of data of the business vertical to which it will cater to and less portion of others and LM of the business catering to.

@reuben @lissyx, any advice here…

Thanks
Kunal

victornoriega7 · March 11, 2020, 4:04pm

No, the acoustic model won’t get confused to words, since is not learning words, but utterances or letters. The Language Model (LM) is the one learning words, and that can get confused in some sentences.