Tune MoziilaDeepSpeech to recognize specific sentences

exactly, so is there a way by which I can combine my created lm.binary and the lm.binary provided by deepspeech in the pretrained model? and also both the tries too

There’s an open issue on the idea of using two language models here: https://github.com/mozilla/DeepSpeech/issues/1678

Until there’s movement on that, if you want both your commands along with more general vocabulary, I think the best approach would be adding additional general sentences - the number added may need some tuning, as presumably the more general ones added the greater the chance the commands get mistaken.

1 Like

Hey, Neil.
My language model consists of words that are atypical words like company names, names of products etc.
So , my accuracy suffers because of it. (not very much but some words just give erratic text output)
Can you help me with fine tuning the acoustic model, I dont really need to train the acoustic model as my list of keyword / vocab.txt (created along the lines of your vocab.txt file) with special sentences is not a large file.
Any specific method to fine tune the acoustic model which can help increase the accuracy for the specific sentences?
Thanks in advance, and for being so helpful.

Hi @singhal.harsh2 what have you tried with the acoustic model already?

Do you have audio recordings and in particular with the atypical words you mention? I don’t know for sure what to advise (so take this with a pinch of salt), but I’d suspect that just fine tuning with that kind of audio data in the normal manner (ie as per the README) would help.

Best of luck!

@nmstoker thankyou for the guide. very helpful. I am having one strange behavior though, in inference I get results that are not in my vocabulary.txt at all. I have created the trie, words.arpa and lm.binary and I use those obviously. Does this make sense? I use the pre-trained model 0.5.0

Hi @safas - that does seem odd.

If you’ve closely followed the instructions then it should only return words from your vocabulary file (as per the video where you see it interpret things I say, when talking to the viewer at the end, as the closest fitting words from the vocab only (which are completely not the actual words I’ve said))

I expect you’ve checked carefully already but is it possible that at some stage you’ve either pulled in a different vocabulary file or somehow pointed the script at a different LM, words.arpa file or trie?

1 Like

@nmstoker, yes I had messed up. The issue was a mismatch in trie generation that gave en error, but went ahead anyway and I guess it then falls back to the language model in repo. I have tested this now with a vocab of 20K sentences and it works quite well. There is room for improvement though. will update on that.

Glad you got to the bottom of it!

1 Like

@nmstoker the issue with util/taskcluster.py is fixed on master, however, for others reading this thread and checking out v0.5.1 it will still fetch the wrong native_client.tar.gz so, I suggest you edit your post with:
change:

you’ll need to have downloaded the relevant native client tar file for your environment ( for me that was native_client.amd64.cuda.linux.tar.xz ) and use generate_trie from there OR build it (this will be more complex and I didn’t go this route for speed)

Add:
Use util/taskcluster.py --branch <v0.5.0> …

hii @nmstoker I am not able to analyze loss
e.g
Epoch 1 | Training | Elapsed Time: 1 day, 23:54:37 | Steps: 2528 | Loss: 100.6082

and starting loss is around 178
how to calculate loss ?

and another question is:
I am train a model but not give --epochs parameter so,is any default value for epochs

Thank you

I followed these instructions, but I am noticing some strange behavior (it sort of makes sense, but I want to mitigate it). I am trying to use deepspeech with a very small number of commands. I created a corpus that uses these command phrases which include sequences of numbers and then created the custom trie and lm.binary files.

The LM works and increases the accuracy of the model for my use case. The strange behavior is that the model becomes very bad at ignoring OOV words. Instead of classifying it as an , it seems to be forcing things into a bucket.

For example, I created a LM that focuses mostly on numbers and then as a smoke test, I passed it audio from LibriSpeech which may include numbers sometimes; it is mostly other words.

The output of that is:

two nine one two 
one one ten one
five ten four seven
three one

Is there a way to check confidences to manually ignore, or can I set this up differently to better ignore them by means of the language model? Thanks!

I am not sure if I am fully right here, but you can configure a fixed word for oov cases and handle it during recognition of oov words.

You can check the kenlm documentation for sa me. After you set this, your words.arpa file will have it with “unk” tag.

Hope it helps. Once I get access to laptop I can Google and read more and give you better answer.

I have a similar but slightly different issue. I have a list of 100 short phrases (2-5 words). I want to restrict recognition to only one of these phrases, and nothing else, not even single words from these phrases.

I understand this might be tricky in microphone streaming case, as the phrase boundary is more fluid there, so am trying only by passing the full wav file of the phrase.

I tried deleting the 1-grams manually from LM, but I get a run-time error that all words need to be in 1-grams. Is there a better way to restrict all responses strictly to only one of the phrases, thereby improving the accuracy.

Regards,

PS: Please keep up the great work you are doing on this project.

Hello @nmstoker and thank you for this sharing…

I have to create a french DeepSpeech model… Have you some ideas to share with me ?

I would to start following these steps using this french database which contains MP3 audio files only (I guess I can convert them after on wav)…

Could you help me please ?

Thanks :wink:

@kamil_BENTOUNES welcome to the forum! I hope you are well.

Are there any specific points you want help with?

The documentation in the repo itself is probably the best place to start - I appreciate that those French instructions may be easier for you to follow but it looks like it’s a relatively old post so you may run into problems where they differ from what’s in the latest repo (or version 0.6 which is the last one with a model released, although that’s an English model).

I’d also suggest looking over the forum for others working on French language models (I’m pretty sure there are some people involved in that already)

Hi @nmstoker, I’m fine thank you, and hope you do too.

Thank you for your quick response. Actually I found your idea very interesting in the sense that it’s exactly what I want to do (I tested word recognition models but I am not too convinced of the result despite that it is good enough) … I’ve found this French model on which I am downloading it (I’m on low speed connexion LOL)… If I find that it works quite well I would like to personalize it as you did with French words. Suddenly I had a question which may seem odd to you but you have used how much audio data for the re-training on desired words? because I looked at your description and at no time you speak about that… Does it work differently?

Thank’s a lot Neil !

I’m working so have to be brief, but the process I described above is purely on the language model, it doesn’t require changes to the acoustic model (ie the part that operates on audio data).

It works because the acoustic model is already able to recognise basic sounds and then the LM is being used to restrict the words that it guesses so that they are only the ones in the shortlist.

If you don’t get great results then you would want to look at using audio (but you would need a large amount) and that’s called fine-tuning (see repo docs for comments on how that’s done).

Do take care to ensure you install the same version/checkpoint of DeepSpeech as the model was trained on and to refer to the matching docs too (both are a common source of misunderstanding and problems!)

Ah I understand perfectly now! Thank you very much and sorry for disturbing… Pay attention to yourself during this health crisis and thank you again !

No problem at all.

I hope you’re doing okay at this time too. Best of luck with the model. Would be interesting to hear how you get on with it in due course :slightly_smiling_face:

sorry i didn’t see your answer. I’ve configured/generated the language model files so that the model will be able to recognize a list of 9 French words. it worked very well (except with the word “banane”). I don’t know if it’s due to the model that I used which is not very efficient or when the word banana is considered as noise because of its phonems unvoiced sounds.

1 Like