TUTORIAL : How I trained a specific french model to control my robot

(Vincent Foucault) #1

Tutorial How to build your homemade deepspeech model from scratch

Adapt links and params with your needs…

For my robotic project, I needed to create a small monospeaker model, with nearly 1000 sentences orders (not just single word !)

I recorded wav’s with a Respeaker Microphone Array :


Wav’s were recorder with the following params : mono / 16 bits / 16 k.

The use of the google vad lib helped me to limit white space before/after each wav, but Deepspeech seems to process wav, and un-necessary white sound too. (But, as Kdavis told me, removing white sound before processing, limits time spent for model creation !)


  • you should have a directory with all your waves (the more, the better !!)

  • and a textfile containing each wav complete transcript per line (utf8 encoded)

we’ll call this textfile, the original textfile.

1 - original textfile cleaning :

  • open the alphabet.txt textfile,
  • feed in your own alphabet,
  • save it
  • open the original textfile,
  • with your best editor, clean the file : all its characters MUST be present in alphabet.txt
    remove any punctuation, but you can keep the apostroph, if present in alphabet, as a character.

2 - create 3 directories : train, dev, test.

3 - feed each dir. with corresponding wav’s and a new transcript’s textfile, as CSV file,
containing those specific wav’s transcript…

Note about the textfiles :

  • you should have train.csv in the train dir, dev.csv in dev dir and test.csv in test dir

  • Each CSV file must start with lign:

  • And an example of my test.csv content:

    /home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
    /home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...
  • It seems that we should separe all wav’s with the following ratio : 70 - 20 - 10 !

70% of all wav’s content in train dir, with corresponding train.csv file,

20% in dev dir, with corresponding dev.csv file,

10% in test dir, with corresponding test.csv file.

IMPORTANT : A wav file can only appear in one directory file.
It’s needed for good model creation (Otherwise, it could result in overfitting…)


Here, we use the original textfile, containing 100% of wav’s transcripts, and we rename it vocabulary.txt

We’ll use the powerfull Kenlm tools for our LM build : http://kheafield.com/code/kenlm/estimation/


1 - Creating arpa file for binary build :

/bin/bin/./lmplz --text vocabulary.txt --arpa  words.arpa --o 3

I asked Kenneth Heafield about -o param (“order model to estimate”)

It seems that for small corpus (my case), a value from 3 to 4 seems to be the best way to success

See lmplz params on web link, if needed.

2 - creating binary file :

/bin/bin/./build_binary -T -s words.arpa  lm.binary


We’ll use the native_client “generate_trie” binary to create our trie file,

-creating trie file :

Adapt your links !

/home/nvidia/tensorflow/bazel-bin/native_client/generate_trie /
/home/nvidia/DeepSpeech/data/alphabet.txt /
/home/nvidia/DeepSpeech/data/lm.binary /
/home/nvidia/DeepSpeech/data/vocabulary.txt /


Verify your directories :




    record.2.wav…(remember : all wav’s are different)

  • DEV




  • TEST




  • vocabulary.txt

  • lm.binary

  • trie

2 - Write your run file :


set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1

python -u DeepSpeech.py \
  --train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
  --dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
  --test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 33 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
  --checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
  --decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/alphabet.txt \
  --lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary \
  --lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie \

Adapt links and params to fit your needs…

Now, run the file IN YOUR DEEPSPEECH directory :

you can leave the computer, watch an entire 12 episodes serie…before end process

IF everything worked correctly, you should now have a /model_export/output_graph.pb ,your model.

My Model : (I didn’t really respect percentage LOL)

  • model size (bp) : 17.9mo
  • train dir : 5700 wav’s (3-5s/wav)
  • dev dir : 1500
  • test dir : 170
  • model stopped by Early_stop param with a 0.06 Wer.

Sure I’ll do better with more material (wav’s and transcripts)

Enjoy your model with inferences.


Now, I have a very good model : Awesome!

But, sure I could do better :
“Alfred”, my robot encounters bad inferences, when I talk to it (him?) in 2 cases :

  • from a 2-3 meters distance,
  • when it patrols in my house.

In first case, the distance produces echos, change voice wave amplitude
In the second case, motor wheels produce noise, ground texture too…
In both cases, noises, echos, amplitude signal variations, cause bad inferences !!

How could I do ?

1/ Well, I’ll use specificity :
Alfred, you need to listen and process an echo wav sentence, due to distance ? Ok !

I’ll make you learn modified sentences (with echos inside wav’s)

  • record same sentence with different distances from microphone, and in different locations in room/house

Noise ? happy voice ? sad one ? --> I’ll make you learn a bit of each !!!

  • Change tone of your voice, speed, and different moment in a day (noise should differ)

Sure it needs a bit of time, recording new waves, to fit all scenarios, but I must say that the inference difference is very impressive : I can talk to my robot when moving, and ask it to stop, for example.

2/ Other way :
A mozilla tool to modify wav’s with echos, pitch, adding noise…

python /media/nvidia/neo_backup/voice-corpus-tool-master/voice.py

A tool to apply a series of commands to a collection of samples.
Usage: voice.py (command … [-opt1 []] [-opt2 []] …)*


Display help message

Adds samples to current buffer
source: string - Name of a named buffer or filename of a CSV file or WAV file (wildcards supported)

Buffer operations:

Randoimize order of the sample buffer

Order samples in buffer by length

Reverse order of samples in buffer

Take given number of samples from the beginning of the buffer as new buffer
number: int - Number of samples

Repeat samples of current buffer times as new buffer
number: int - How often samples of the buffer should get repeated

Skip given number of samples from the beginning of current buffer
number: int - Number of samples

Drop all samples, who’s transcription does not contain a keyword
keyword: string - Keyword to look for in transcriptions

Clears sample buffer

Named buffers:

Replaces named buffer with contents of buffer
name: string - Name of the named buffer

Moves buffer to named buffer (buffer will be empty afterwards)
name: string - Name of the named buffer

Appends buffer to named buffer
name: string - Name of the named buffer

Drops named buffer
name: string - Name of the named buffer


Prints list of samples in current buffer

Play samples of current buffer

write <dir_name>
Write samples of current buffer to disk
dir_name: string - Path to the new sample directory. The directory and a file with the same name plus extension “.csv” should not exist.


reverb [-room_scale <room_scale>] [-hf_damping <hf_damping>] [-wet_gain <wet_gain>] [-stereo_depth <stereo_depth>] [-reverberance ] [-wet_only] [-pre_delay <pre_delay>]
Adds reverberation to buffer samples
-room_scale: float - Room scale factor (between 0.0 to 1.0)
-hf_damping: float - HF damping factor (between 0.0 to 1.0)
-wet_gain: float - Wet gain in dB
-stereo_depth: float - Stereo depth factor (between 0.0 to 1.0)
-reverberance: float - Reverberance factor (between 0.0 to 1.0)
-wet_only: bool - If to strip source signal on output
-pre_delay: int - Pre delay in ms

echo <gain_in> <gain_out> <delay_decay>
Adds an echo effect to buffer samples
gain_in: float - Gain in
gain_out: float - Gain out
delay_decay: string - Comma separated delay decay pairs - at least one (e.g. 10,0.1,20,0.2)

Adds an speed effect to buffer samples
factor: float - Speed factor to apply

Adds a pitch effect to buffer samples
cents: int - Cents (100th of a semi-tome) of shift to apply

Adds a tempo effect to buffer samples
factor: float - Tempo factor to apply

Adds a SoX effect to buffer samples
effect: string - SoX effect name
args: string - Comma separated list of SoX effect parameters (no white space allowed)

augment [-gain ] [-times ]
Augment samples of current buffer with noise
source: string - CSV file with samples to augment onto current sample buffer
-gain: float - How much gain (in dB) to apply to augmentation audio before overlaying onto buffer samples
-times: int - How often to apply the augmentation source to the sample buffer

Oh, I can modify a whole csv file with a lot of params…
Calling a csv file, you process on the fly modifications on every wav in csv !!!
I tested PITCH, SPEED, TEMPO, with values between (-0.05, 0.05), with very good results (test Wer/2)

…to follow!

Can I use other language modelling tools than KenLM
Cleaning Transcript Files (Invalid label when building trie)
Creation of language model and trie
Customizing language model
Can we use DeepSpeech for Vietnamese Speech To Text?
Improving accuracy by creating a specific model?
(tbozo) #2

Any chance that you’ll release the created model ?
Would it work (usage , not creation) with a low end computer as a raspberry pi ?
it would be nice to use it in french with openjarvis.com for example.

(Vincent Foucault) #3

Hi, tbozo,
Well, my model wouldn’t help you, and I explain why :

  • the model is limited to my own voice -> it wouldn’t recognize you at all !!
  • the model is strictly limited to my bot crossed questions possibilities.

Yes, it should work on a RPI2-3 (but ask Gerard-majax for that)

But I plan to create a multi-speakers french model, helped with voxforge.
This one would suit you !

This model will work on one of my next robot, the QBO1 (CORPORA), based on arduino/RPI3.

Now, about openjarvis, my AI is based on Rivescript-python (very-very powerfull !! You should try it !!)

Hope this helped you.

(Lissyx) #4

It’s going to highly depends on your expectations, but I tend to remember that @elpimous_robot model was smallish enough that it would run not that bad on RPi2/3. I would not expect realtime though, but it might be manageable in your case.

(tbozo) #5

If you need some other voices for your model I might help you.
Openjarvis is nicely packaged and easy to use (at least if you use raspbian Jessie and not stretch). I use snowboy for offline hotword detection and bing otherwise. It works quite well in french (some problems with my 7 years old child)
As for the interactions it seems that the syntax is comparable at least for basic tasks. there is a plugin for rivescript called Rivescript bot.
Unfortunatly I can’t afford time consuming project right now…
I’m a C++/Python guy at work, but at home I go for the easiest :sunglasses:

(tbozo) #6

@lissyx I was looking for a real time seems that I might need to change my hardware :slight_smile:
I’ll stay with my bing API right now…

(Vincent Foucault) #7

how many fr voices do you have ? I’m interested !
For now, I only have nearly 5h of my own voice (nearly 5000 train samples…)

Working on voxforge, to recover all fr material, but it’s harder than I expected (It would take more time…)

With a standard STT, child voice is hard to recognize, due to a different frequency;

but, with deep learning, it pass this restriction.
send me private msg for specific french discussion, if you want !

How to classify unknown words, how to ignore words

@elpimous_robot @lissyx

Is it possible to create a new trie/language model (as explained above) using transcripts with more jargon/technical speak, then use those in conjunction with the pre-built Mozilla firefox deepspeech output_graph? I don’t have the resources to train a completely new model, but I can certainly generate the language models for my specific (technical) version of English speakers. I’m just not sure if using the pre-built output_graph with the new language model and trie will work.

(Matti Meikäläinen) #9

I cannot find program named “generate_trie”… In my DeepSpeech folder there is a subfolder named native_client, but there is only generate_trie.cpp. Should I first compile it somehow? Could you give more instructions on how I can call generate_trie?

(Vincent Foucault) #10

Hi Mark2
To obtain the generate_trie file,
I had to compile native client !
Have a look at native_client/readme.md file

(Vincent Foucault) #11

Hi Dj-Hay.
Sure. Creating a new trie file / vocabulary could help you to recognize new words/sentences.

Be sure to have a complete sentence per lign, on your vocab, and not only 1 word !!

(Yv) #12

If you don’t want to do all the setup for building deepspeech from source, I’d recommend downloading mozilla’s pre-built native_client and use generate_trie command from there - see https://github.com/mozilla/DeepSpeech/tree/master/native_client

Basically running the following command should do the trick.

python util/taskcluster.py --target /path/to/destination/folder

(Matti Meikäläinen) #13

I ran the command and called pre-built generate_trie-program. However, I got a “-bash: ./generate-trie: cannot execute binary file” error, although it has execution permission for all. Is it because it was compiled on Linux and I use MacOS? Is there any workarounds or should I compile the program from sources?

(Lissyx) #14

taskcluster.py downloads linux by default. You need to pass --arch osx as documented.


Great, thanks! I’m curious as to what the neural network is doing then. Is it generating a bunch of vowel/consonant sound primitives that are fed into the trie/lm.binary? Then that trie/lm.binary decides which of the most probable words that ordering of vowel/consonant sound makes?


Ah, nevermind. I think the original paper (https://arxiv.org/abs/1408.2873) does show that the DNN part is predicting characters from the alphabet. Thus the DNN will create a large chunk of letters/spaces from the given audio. Then, that will be fed into the language model (which is completely separate from the DNN) and will optimize which combination of spaces and letters make the best sentence/words. Please correct me if I’m wrong.

(P Holetzky) #17

Thanks for your tutorial, I’m currently training a german model using an open source corpus, this is a big help!

I was wondering why you use your vocabulary.txt instead of your alphabet.txt in your --alphabet_config_path parameter for DeepSpeech.py?

(Vincent Foucault) #18

Hi. Happy to help you.

Thanks for the question : My fault !!

Of course, you link to alphabet.txt !!!

Vocabulary.txt is used for lm/trie…

If you see others errors…

See you.

(Vincent Foucault) #19

In alphabet.txt, you only have symbols !!
Each symbol is a label.
Deepspeech learns each label with a lot of sounds.

Some others params lm/trie work hard to evaluate one heard sentence, and predict result inference)


Thanks for your tutorial. We have hundreds of audio files for just one person/speaker and are considering making a specific model. Was considering breaking up each audio into single words, for training purposes. However, now I see by your comment that a complete sentence is preferred.

My thinking on using the single word approach was to significantly reduce the size of the model, as it is for one person/speaker. For example, a 19 second WAV that has 55 words has 33 unique words. Is there any advantage in using the same word by the same speaker for training the model ?

I guess my question is - how differently can one person speak one word ?

KeyError in self._str_to_label[string] of DeepSpeech/util/text.py when training own model