Learning new words for STT

I have collected some data from youtube containing technical words,can collect more if this works. I want the DeepSpeech models to recognize them too. I started with fine-tuning but that did not provide any potential results. I trained it from scratch but in this case also the results are not that promising. I am stuck on how to make the model learn more proper nouns. Please help me on how to proceed with this. Thanks in advance.

DeepSpeech verison : 0.7.4
GPU : Tesla P40
YouTube data size : 5 hours (approxiamtely)

  1. 5 hours is not enough to learn a language with special words. Maybe for just a couple of commands.

  2. Search fine tuning and transfer learning here to see what parameters others used. You didn’t give us any info what you did, so it is hard to give advice.

  3. Use a custom scorer. If the new words don’t appear in the scorer often, it can’t find them.

thanks a lot sir. I tried with the custom scorer today only. but the scorer file size was found to be 1.5 mb where are the kenlm.scorer is more than 900 mb. Used to following commands to get it. and also it is performing a little less as comapared to the existing scorer file. What can be the possible reason ? Please help

python ./DeepSpeech-0.7.4/data/lm/generate_lm.py --input_txt ./combined_scratch/scorer_data.txt --output_dir ./combined_scratch/ --top_k 1000 --kenlm_bins ./scorer_stuffs/kenlm_master/build/bin/ --arpa_order 4 --max_arpa_memory “90%” --arpa_prune 0 --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

python generate_package.py --alphabet …/alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 1.234 --default_beta 1.012

DeepSpeech 0.7.4

How much data is in your scorer_data.txt? A normal scorer would contain all of Wikipedia of your language and all other textual data you can find or what should be recognized by the scorer.

it contains around 37,000 lines that will be close to 1,85,000 words in total. Is that far too less to make a scorer ?

It always depends on your use case, but Wikipedia easily gives you a couple million. So, for general language understanding this is not much. For special use cases this amount could be OK.

Actually, the use case is to recognize technical or domain-specific words.

So is it normal that the scorer file size is just 1.5 MB ?

And just to be clear, having one scorer file from my current data and training from scratch using that should have considerable performance ?

Ideally, the scorer has many different sentences with the technical words in it that you want to recognize. It looks for the most probably combination of words in the scorer for the letters it finds in the audio.

If you want to identify just 2 words in different combinations the scorer can be 5B.

If you want to recognize 1000 technical terms in different sentences, you should have 500MB of raw text input.

These are not exact numbers, it is just to give you estimates.

thanks a lot, @othiele… :grinning:.It helped a lot to make the performance way better. Just made the custom scorer along with the lm_optimizer.py and the WER decreased by 10% instantly.

1 Like