Tune MoziilaDeepSpeech to recognize specific sentences

hii @nmstoker I am not able to analyze loss
e.g
Epoch 1 | Training | Elapsed Time: 1 day, 23:54:37 | Steps: 2528 | Loss: 100.6082

and starting loss is around 178
how to calculate loss ?

and another question is:
I am train a model but not give --epochs parameter so,is any default value for epochs

Thank you

I followed these instructions, but I am noticing some strange behavior (it sort of makes sense, but I want to mitigate it). I am trying to use deepspeech with a very small number of commands. I created a corpus that uses these command phrases which include sequences of numbers and then created the custom trie and lm.binary files.

The LM works and increases the accuracy of the model for my use case. The strange behavior is that the model becomes very bad at ignoring OOV words. Instead of classifying it as an , it seems to be forcing things into a bucket.

For example, I created a LM that focuses mostly on numbers and then as a smoke test, I passed it audio from LibriSpeech which may include numbers sometimes; it is mostly other words.

The output of that is:

two nine one two 
one one ten one
five ten four seven
three one

Is there a way to check confidences to manually ignore, or can I set this up differently to better ignore them by means of the language model? Thanks!

I am not sure if I am fully right here, but you can configure a fixed word for oov cases and handle it during recognition of oov words.

You can check the kenlm documentation for sa me. After you set this, your words.arpa file will have it with ā€œunkā€ tag.

Hope it helps. Once I get access to laptop I can Google and read more and give you better answer.

I have a similar but slightly different issue. I have a list of 100 short phrases (2-5 words). I want to restrict recognition to only one of these phrases, and nothing else, not even single words from these phrases.

I understand this might be tricky in microphone streaming case, as the phrase boundary is more fluid there, so am trying only by passing the full wav file of the phrase.

I tried deleting the 1-grams manually from LM, but I get a run-time error that all words need to be in 1-grams. Is there a better way to restrict all responses strictly to only one of the phrases, thereby improving the accuracy.

Regards,

PS: Please keep up the great work you are doing on this project.

Hello @nmstoker and thank you for this sharingā€¦

I have to create a french DeepSpeech modelā€¦ Have you some ideas to share with me ?

I would to start following these steps using this french database which contains MP3 audio files only (I guess I can convert them after on wav)ā€¦

Could you help me please ?

Thanks :wink:

@kamil_BENTOUNES welcome to the forum! I hope you are well.

Are there any specific points you want help with?

The documentation in the repo itself is probably the best place to start - I appreciate that those French instructions may be easier for you to follow but it looks like itā€™s a relatively old post so you may run into problems where they differ from whatā€™s in the latest repo (or version 0.6 which is the last one with a model released, although thatā€™s an English model).

Iā€™d also suggest looking over the forum for others working on French language models (Iā€™m pretty sure there are some people involved in that already)

Hi @nmstoker, Iā€™m fine thank you, and hope you do too.

Thank you for your quick response. Actually I found your idea very interesting in the sense that itā€™s exactly what I want to do (I tested word recognition models but I am not too convinced of the result despite that it is good enough) ā€¦ Iā€™ve found this French model on which I am downloading it (Iā€™m on low speed connexion LOL)ā€¦ If I find that it works quite well I would like to personalize it as you did with French words. Suddenly I had a question which may seem odd to you but you have used how much audio data for the re-training on desired words? because I looked at your description and at no time you speak about thatā€¦ Does it work differently?

Thankā€™s a lot Neil !

Iā€™m working so have to be brief, but the process I described above is purely on the language model, it doesnā€™t require changes to the acoustic model (ie the part that operates on audio data).

It works because the acoustic model is already able to recognise basic sounds and then the LM is being used to restrict the words that it guesses so that they are only the ones in the shortlist.

If you donā€™t get great results then you would want to look at using audio (but you would need a large amount) and thatā€™s called fine-tuning (see repo docs for comments on how thatā€™s done).

Do take care to ensure you install the same version/checkpoint of DeepSpeech as the model was trained on and to refer to the matching docs too (both are a common source of misunderstanding and problems!)

Ah I understand perfectly now! Thank you very much and sorry for disturbingā€¦ Pay attention to yourself during this health crisis and thank you again !

No problem at all.

I hope youā€™re doing okay at this time too. Best of luck with the model. Would be interesting to hear how you get on with it in due course :slightly_smiling_face:

sorry i didnā€™t see your answer. Iā€™ve configured/generated the language model files so that the model will be able to recognize a list of 9 French words. it worked very well (except with the word ā€œbananeā€). I donā€™t know if itā€™s due to the model that I used which is not very efficient or when the word banana is considered as noise because of its phonems unvoiced sounds.

1 Like

hi, can you please tell where to download the native_client tar file?. i wanted to infer in CPU with v0.5.0. i am looking for generate_trie. please provide the link.

As far as I remember old native clients are no longer available due to some technical changes. You will have to build it yourself.

thank you for the reply. i will build it myself

hi, thanks for the wonderful tutorial. i trained a language model for only 12 words, but when i used with the pretrained model for most of the words the output is empty. i want to predict only those 12 words. how to improve the accuracy of language model?. if i dont use the language model output is gibberish.

Hi @Ajay_Ganesan - this might be hard to diagnose.
I would start with confirming that it isnā€™t a recording quality or accent issue by trying to find the words in your list of 12 in another source, ideally one where theyā€™re said by people with a US accent (as the majority of the acoustic model training has been done with US accented speakers). That would at least give a sense about whether it is equally challenged by those samples or not.

Can I check also if youā€™ve stuck with an older version of DeepSpeech - your comments above are asking about 0.5.0 and Iā€™m guessing perhaps you stuck with that to be able to follow the steps in the tutorial. Generally as there have been some significant improvements Iā€™d suggest trying to use 0.7.1 - I realise you will have to make some changes to the steps as the handling of the LM has changed a bit but the principals are pretty similar (if anything itā€™s easier now and itā€™s well documented).

Switching wonā€™t necessarily help if itā€™s an accent thing, as I believe the model is still stronger with US accents - it does pretty well with my UK accent but there are areas where the accent seems like itā€™s struggling for me too. If you were in that situation then the way forward would be to look at fine tuning the model but youā€™d need a decent amount of audio transcribed and Iā€™d try to narrow down what the issue is before going down that route.

Hope that helps? Good luck!

Hi @nmstoker, this is very helpful but while generating lm.binary and other output files, I am getting error. Please help me generate lm.binary

usage: ipykernel_launcher.py [-h] --vocabulary.txt VOCABULARY.TXT --output_dir OUTPUT_DIR --top_k TOP_K --kenlm_bins
KENLM_BINS --arpa_order ARPA_ORDER --max_arpa_memory MAX_ARPA_MEMORY --arpa_prune
ARPA_PRUNE --binary_a_bits BINARY_A_BITS --binary_q_bits BINARY_Q_BITS --binary_type
BINARY_TYPE [ā€“discount_fallback]
ipykernel_launcher.py: error: the following arguments are required: --vocabulary.txt, --output_dir, --top_k, --kenlm_bins, --arpa_order, --max_arpa_memory, --arpa_prune, --binary_a_bits, --binary_q_bits, --binary_type

An exception has occurred, use %tb to see the full traceback.
SystemExit: 2

@Yugandhar_Gantala your post doesnā€™t seem to have enough information to investigate further. Can you give a bit more detail on what youā€™re actually doing, versions, environment etc. It looks like youā€™ve called some code and you havenā€™t passed the parameters.

Imagine I canā€™t see what youā€™re doing (because I cannot :slightly_smiling_face:)

Please search the forum and if you post give us more to work on. There are several posts on building the scorer.

Hey @nmstoker,
I have created my own vocabulary.txt file and I want to train deepspeech 0.7.3 pretrained model on my own vocabulary. I did follow the steps you mentioned above. Now I am trying to generate the output files (lm.binary, warpa.words, trie), while generating I am getting an error that the arguments are required ā€œā€“vocabulary.txt, --output_dir, --top_k, --kenlm_bins, --arpa_order, --max_arpa_memory, --arpa_prune, --binary_a_bits, --binary_q_bits, --binary_typeā€.
What is generate_trie for? We will be getting an output file trie right, isnā€™t that enough to train the model on vocabulary.txt