Tune MoziilaDeepSpeech to recognize specific sentences

nmstoker · June 15, 2019, 4:33pm

Hi @dara1400 - sorry, didn’t manage to get onto this yesterday evening, but I’ve managed to get it working and attach the LM (zipped up) plus did a quick video to demo it working.

And the good news is that it’s pretty effective with your list of words/phrases.

I’ll list out some details below (you may well know some of this from your experiments, but hopefully it may help others also trying to do this)

I hope this helps - if you still have issues, post the errors you see (in detail ideally), at which point they occur etc etc and we can try to figure it out from there

Key background:

is worth refering to here for some basic detail on the LM: DeepSpeech/data/lm at master · mozilla/DeepSpeech · GitHub
do not make mistake of following the KenLM BUILDING instructions in DeepSpeech/native_client/kenlm at master · mozilla/DeepSpeech · GitHub (not required)
you can (+ should) use the official KenLM repo (this doesn’t seem to be covered in requirements, but would be a useful PR)
these steps are v. similar to what’s shown in here, it’s simply that I’ve tested them with the 0.5.0 model: TUTORIAL : How I trained a specific french model to control my robot
you’ll need to have downloaded the relevant native client tar file for your environment (for me that was native_client.amd64.cuda.linux.tar.xz) and use generate_trie from there OR build it (this will be more complex and I didn’t go this route for speed)

The aim is to take your input (ie the list of “one”, “two” etc) and produce the two output files lm.binary and trie

input file:
vocabulary.txt – this is the file of phrases that you want your LM to process

output files:

words.arpa – used to produce the other outputs, not used directly by DeepSpeech
lm.binary
trie

Steps

Install and build KenML

See details here or try a pre-build binary for your distro if one exists

Create a working directory

mkdir working
mkdir working/training_material
mkdir working/language_models

Create the file for your phrases

vocabulary.txt - store it in training_material, with each sentence on its own line

From base of KenML folder

build/bin/lmplz --text working/training_material/vocabulary.txt --arpa working/language_models/words.arpa --order 5 --discount_fallback --temp_prefix /tmp/

note: I had previously also played around using –order of 3, 4 and 5 along with –prune 0 0 0 1 (for order 5)
I don’t recall exactly why I’d used prune, but didn’t seem needed here. However like your earlier attempts I did need –discount_fallback (seemingly as the list of phrases is small)

build/bin/build_binary -T -s trie working/language_models/words.arpa working/language_models/lm.binary

Using generate_trie from native_client

See above point about native_client

/path_to_native_client/generate_trie /path_to_deepspeech/DeepSpeech/data/alphabet.txt working/language_models/lm.binary working/language_models/trie

Testing it out

You’ll need to have installed deepspeech for this part onwards

deepspeech --model /path_to_models/deepspeech-0.5.0-models/output_graph.pbmm --alphabet /path_to_models/deepspeech-0.5.0-models/alphabet.txt --lm working/language_models/lm.binary --trie working/language_models/trie --audio /path_to_test_wavs/p225_27280.wav

Gives output like this (note, my test wav file didn’t have many words from the custom LM, but this shows how it clearly is using the LM):

Loading model from file deepspeech-0.5.0-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.0-alpha.11-0-g1201739
2019-06-15 16:28:50.969519: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-15 16:28:51.046477: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-15 16:28:51.046944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.42GiB
2019-06-15 16:28:51.046955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-15 16:28:51.457048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-15 16:28:51.457068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-06-15 16:28:51.457072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-06-15 16:28:51.457141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10088 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-06-15 16:28:51.460453: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-06-15 16:28:51.460467: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-06-15 16:28:51.460473: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-06-15 16:28:51.460598: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.493s.
Loading language model from files /home/neil/main/Projects/kenlm/working/language_models/lm.binary /home/neil/main/Projects/kenlm/working/language_models/trie
Loaded language model in 0.00601s.
Warning: original sample rate (22050) is different than 16kHz. Resampling might produce erratic speech recognition.
Running inference.
2019-06-15 16:28:51.655464: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
the would the form one how one one the one one
Inference took 2.372s for 6.615s audio file.

Trying it out with the Mic VAD Streaming example

python mic_vad_streaming.py -d 7 -m ../../models/deepspeech-0.5.0-models/ -l /path_to_working/working/language_models/lm.binary -t /path_to_working/working/language_models/trie -w wavs/

(this is using the script here)
custom_lm.zip (2.6 KB)