Tune MoziilaDeepSpeech to recognize specific sentences

@dara1400 I don’t have my computer in front of me, so can’t check exactly what worked for me previously, and like @yv001 I have yet to try a custom LM with 0.5.0.

If you could post the commands you’ve used I can try to have a look tomorrow evening (or maybe the weekend, depending on my timing) .

My general principle here is I’m happy to help (within reason!) if you can make the case you’ve put in some efforts with exploration of the forum, source code & done that kind of leg work already.

1 Like

ok thanks alot

I ran the kenlm in google colab
I uploaded the code and my text file

whitout --discount_fallback I got this error:

/content/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0’.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 3 because we didn’t observe any 1-grams with adjusted count 2; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback

after adding --discount_fallback the lm.binary and trie generated, but when I use deepspeech it crashes.

NewLm.zip (3.0 KB)

1 Like

Hi @dara1400 - sorry, didn’t manage to get onto this yesterday evening, but I’ve managed to get it working and attach the LM (zipped up) plus did a quick video to demo it working.

And the good news is that it’s pretty effective with your list of words/phrases.

I’ll list out some details below (you may well know some of this from your experiments, but hopefully it may help others also trying to do this)

I hope this helps - if you still have issues, post the errors you see (in detail ideally), at which point they occur etc etc and we can try to figure it out from there :slightly_smiling_face:

Key background:

The aim is to take your input (ie the list of “one”, “two” etc) and produce the two output files lm.binary and trie

input file:
vocabulary.txt – this is the file of phrases that you want your LM to process

output files:

  • words.arpa – used to produce the other outputs, not used directly by DeepSpeech
  • lm.binary
  • trie


Install and build KenML

See details here or try a pre-build binary for your distro if one exists

Create a working directory

mkdir working
mkdir working/training_material
mkdir working/language_models

Create the file for your phrases

vocabulary.txt - store it in training_material, with each sentence on its own line

From base of KenML folder

build/bin/lmplz --text working/training_material/vocabulary.txt --arpa working/language_models/words.arpa --order 5 --discount_fallback --temp_prefix /tmp/

note: I had previously also played around using –order of 3, 4 and 5 along with –prune 0 0 0 1 (for order 5)
I don’t recall exactly why I’d used prune, but didn’t seem needed here. However like your earlier attempts I did need –discount_fallback (seemingly as the list of phrases is small)

build/bin/build_binary -T -s trie working/language_models/words.arpa working/language_models/lm.binary

Using generate_trie from native_client

See above point about native_client

/path_to_native_client/generate_trie /path_to_deepspeech/DeepSpeech/data/alphabet.txt working/language_models/lm.binary working/language_models/trie

Testing it out

You’ll need to have installed deepspeech for this part onwards

deepspeech --model /path_to_models/deepspeech-0.5.0-models/output_graph.pbmm --alphabet /path_to_models/deepspeech-0.5.0-models/alphabet.txt --lm working/language_models/lm.binary --trie working/language_models/trie --audio /path_to_test_wavs/p225_27280.wav

Gives output like this (note, my test wav file didn’t have many words from the custom LM, but this shows how it clearly is using the LM):

Loading model from file deepspeech-0.5.0-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.0-alpha.11-0-g1201739
2019-06-15 16:28:50.969519: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-15 16:28:51.046477: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-15 16:28:51.046944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.42GiB
2019-06-15 16:28:51.046955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-15 16:28:51.457048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-15 16:28:51.457068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-15 16:28:51.457072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-15 16:28:51.457141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10088 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-06-15 16:28:51.460453: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-06-15 16:28:51.460467: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-06-15 16:28:51.460473: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-06-15 16:28:51.460598: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.493s.
Loading language model from files /home/neil/main/Projects/kenlm/working/language_models/lm.binary /home/neil/main/Projects/kenlm/working/language_models/trie
Loaded language model in 0.00601s.
Warning: original sample rate (22050) is different than 16kHz. Resampling might produce erratic speech recognition.
Running inference.
2019-06-15 16:28:51.655464: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
the would the form one how one one the one one
Inference took 2.372s for 6.615s audio file.

Trying it out with the Mic VAD Streaming example

python mic_vad_streaming.py -d 7 -m ../../models/deepspeech-0.5.0-models/ -l /path_to_working/working/language_models/lm.binary -t /path_to_working/working/language_models/trie -w wavs/

(this is using the script here)
custom_lm.zip (2.6 KB)


that was wonderful
thanks a lot
you are wonderful


I have tested my lm and your trie, it works fine
but when I use my trie it crashes, so I think the problem is the generate_trie that I am using.
currently I am using google colab and native_client.amd64.cpu.linux.tar.xz

The windows generate_trie is working well.
I am wondering which native build is fine for google colab

That’s interesting and something of a surprise about the windows generate_trie working. Glad you’ve been able to make progress!

1 Like

Hey Neil, thank you so much for providing such detailed steps for customising language model and adding specific sentences.
I used your steps and it worked wonders. Thanks a lot!

Although I am encountering one problem, now I only get output with the vocabulary of those specific sentences that I added and not the whole generic words that original lm.binary file had. Can you please suggest something to solve this…

I created my lm and trie ,
but the deepspeech confuses some of my commands.
maybe be I should mention that i am not a native so i dont have English accent like you.

I attached my vocab, lm and trie

NewLm2.zip (6.1 KB)

I am also not a native.
My point was that my vocab, trie and lm works perfectly fine for all the sentences that I have in my vocab. But when I run this :

deepspeech --model /path_to_models/deepspeech-0.5.0-models/output_graph.pbmm --alphabet /path_to_models/deepspeech-0.5.0-models/alphabet.txt --lm working/language_models/lm.binary --trie working/language_models/trie --audio /path_to_test_wavs/p225_27280.wav

for sentences or words which are not in vocab, output only has the words in vocab so hence not good enough accuracy then. Do you understand what I am trying to say?

1 Like

so you can detect what is the voice command.
you dont need to know every word

exactly, so is there a way by which I can combine my created lm.binary and the lm.binary provided by deepspeech in the pretrained model? and also both the tries too

There’s an open issue on the idea of using two language models here: https://github.com/mozilla/DeepSpeech/issues/1678

Until there’s movement on that, if you want both your commands along with more general vocabulary, I think the best approach would be adding additional general sentences - the number added may need some tuning, as presumably the more general ones added the greater the chance the commands get mistaken.

1 Like

Hey, Neil.
My language model consists of words that are atypical words like company names, names of products etc.
So , my accuracy suffers because of it. (not very much but some words just give erratic text output)
Can you help me with fine tuning the acoustic model, I dont really need to train the acoustic model as my list of keyword / vocab.txt (created along the lines of your vocab.txt file) with special sentences is not a large file.
Any specific method to fine tune the acoustic model which can help increase the accuracy for the specific sentences?
Thanks in advance, and for being so helpful.

Hi @singhal.harsh2 what have you tried with the acoustic model already?

Do you have audio recordings and in particular with the atypical words you mention? I don’t know for sure what to advise (so take this with a pinch of salt), but I’d suspect that just fine tuning with that kind of audio data in the normal manner (ie as per the README) would help.

Best of luck!

@nmstoker thankyou for the guide. very helpful. I am having one strange behavior though, in inference I get results that are not in my vocabulary.txt at all. I have created the trie, words.arpa and lm.binary and I use those obviously. Does this make sense? I use the pre-trained model 0.5.0

Hi @safas - that does seem odd.

If you’ve closely followed the instructions then it should only return words from your vocabulary file (as per the video where you see it interpret things I say, when talking to the viewer at the end, as the closest fitting words from the vocab only (which are completely not the actual words I’ve said))

I expect you’ve checked carefully already but is it possible that at some stage you’ve either pulled in a different vocabulary file or somehow pointed the script at a different LM, words.arpa file or trie?

1 Like

@nmstoker, yes I had messed up. The issue was a mismatch in trie generation that gave en error, but went ahead anyway and I guess it then falls back to the language model in repo. I have tested this now with a vocab of 20K sentences and it works quite well. There is room for improvement though. will update on that.

Glad you got to the bottom of it!

1 Like

@nmstoker the issue with util/taskcluster.py is fixed on master, however, for others reading this thread and checking out v0.5.1 it will still fetch the wrong native_client.tar.gz so, I suggest you edit your post with:

you’ll need to have downloaded the relevant native client tar file for your environment ( for me that was native_client.amd64.cuda.linux.tar.xz ) and use generate_trie from there OR build it (this will be more complex and I didn’t go this route for speed)

Use util/taskcluster.py --branch <v0.5.0> …