I have collected some data from youtube containing technical words,can collect more if this works. I want the DeepSpeech models to recognize them too. I started with fine-tuning but that did not provide any potential results. I trained it from scratch but in this case also the results are not that promising. I am stuck on how to make the model learn more proper nouns. Please help me on how to proceed with this. Thanks in advance.
DeepSpeech verison : 0.7.4
GPU : Tesla P40
YouTube data size : 5 hours (approxiamtely)
5 hours is not enough to learn a language with special words. Maybe for just a couple of commands.
Search fine tuning and transfer learning here to see what parameters others used. You didn’t give us any info what you did, so it is hard to give advice.
Use a custom scorer. If the new words don’t appear in the scorer often, it can’t find them.
thanks a lot sir. I tried with the custom scorer today only. but the scorer file size was found to be 1.5 mb where are the kenlm.scorer is more than 900 mb. Used to following commands to get it. and also it is performing a little less as comapared to the existing scorer file. What can be the possible reason ? Please help
How much data is in your scorer_data.txt? A normal scorer would contain all of Wikipedia of your language and all other textual data you can find or what should be recognized by the scorer.
It always depends on your use case, but Wikipedia easily gives you a couple million. So, for general language understanding this is not much. For special use cases this amount could be OK.
Ideally, the scorer has many different sentences with the technical words in it that you want to recognize. It looks for the most probably combination of words in the scorer for the letters it finds in the audio.
If you want to identify just 2 words in different combinations the scorer can be 5B.
If you want to recognize 1000 technical terms in different sentences, you should have 500MB of raw text input.
These are not exact numbers, it is just to give you estimates.
thanks a lot, @othiele… .It helped a lot to make the performance way better. Just made the custom scorer along with the lm_optimizer.py and the WER decreased by 10% instantly.