TUTORIAL : How I trained a specific french model to control my robot

elpimous_robot · January 26, 2018, 5:13pm

Tutorial How to build your homemade deepspeech model from scratch

Adapt links and params with your needs…

For my robotic project, I needed to create a small monospeaker model, with nearly 1000 sentences orders (not just single word !)

I recorded wav’s with a Respeaker Microphone Array :

https://www.seeedstudio.com/ReSpeaker-Mic-Array-Far-field-w%2F-7-PDM-Microphones-p-2719.html

Wav’s were recorder with the following params : mono / 16 bits / 16 k.

The use of the google vad lib helped me to limit white space before/after each wav, but Deepspeech seems to process wav, and un-necessary white sound too. (But, as Kdavis told me, removing white sound before processing, limits time spent for model creation !)

MATERIAL PREPARATION :

you should have a directory with all your waves (the more, the better !!)
and a textfile containing each wav complete transcript per line (utf8 encoded)

we’ll call this textfile, the original textfile.

1 - original textfile cleaning :

open the alphabet.txt textfile,
feed in your own alphabet,
save it
open the original textfile,
with your best editor, clean the file : all its characters MUST be present in alphabet.txt
remove any punctuation, but you can keep the apostroph, if present in alphabet, as a character.

2 - create 3 directories : train, dev, test.

3 - feed each dir. with corresponding wav’s and a new transcript’s textfile, as CSV file,
containing those specific wav’s transcript…

Note about the textfiles :

you should have train.csv in the train dir, dev.csv in dev dir and test.csv in test dir
Each CSV file must start with lign:
wav_filename,wav_filesize,transcript
And an example of my test.csv content:

    /home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
    /home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...

It seems that we should separe all wav’s with the following ratio : 70 - 20 - 10 !

70% of all wav’s content in train dir, with corresponding train.csv file,

20% in dev dir, with corresponding dev.csv file,

10% in test dir, with corresponding test.csv file.

IMPORTANT : A wav file can only appear in one directory file.
It’s needed for good model creation (Otherwise, it could result in overfitting…)

LANGUAGE MODEL CREATION :

Here, we use the original textfile, containing 100% of wav’s transcripts, and we rename it vocabulary.txt

We’ll use the powerfull Kenlm tools for our LM build : http://kheafield.com/code/kenlm/estimation/

DONT FORGET TO COMPILE, OTHERWISE YOU WILL NOT FIND BINARIES

1 - Creating arpa file for binary build :

/bin/bin/./lmplz --text vocabulary.txt --arpa  words.arpa --o 3

I asked Kenneth Heafield about -o param (“order model to estimate”)

It seems that for small corpus (my case), a value from 3 to 4 seems to be the best way to success

See lmplz params on web link, if needed.

2 - creating binary file :

/bin/bin/./build_binary -T -s words.arpa  lm.binary

TRIE CREATION :

We’ll use the native_client “generate_trie” binary to create our trie file,

-creating trie file :

Adapt your links !

/home/nvidia/tensorflow/bazel-bin/native_client/generate_trie /
/home/nvidia/DeepSpeech/data/alphabet.txt /
/home/nvidia/DeepSpeech/data/lm.binary /
/home/nvidia/DeepSpeech/data/vocabulary.txt /
/home/nvidia/DeepSpeech/data/trie

RUN MODEL CREATION :

Verify your directories :

TRAIN

train.csv

record.1.wav

record.2.wav…(remember : all wav’s are different)
DEV

dev.csv

record.1.wav

record.2.wav…
TEST

test.csv

record.1.wav

record.2.wav…
vocabulary.txt
lm.binary
trie

2 - Write your run file :

run-alfred.sh:

#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;

python -u DeepSpeech.py \
  --train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
  --dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
  --test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 33 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
  --checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
  --decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/alphabet.txt \
  --lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary \
  --lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie \
  "$@"

Adapt links and params to fit your needs…

Now, run the file IN YOUR DEEPSPEECH directory :
/bin/run-alfred.sh

you can leave the computer, watch an entire 12 episodes serie…before end process

IF everything worked correctly, you should now have a /model_export/output_graph.pb ,your model.

My Model : (I didn’t really respect percentage LOL)

model size (bp) : 17.9mo
train dir : 5700 wav’s (3-5s/wav)
dev dir : 1500
test dir : 170
model stopped by Early_stop param with a 0.06 Wer.

Sure I’ll do better with more material (wav’s and transcripts)

Enjoy your model with inferences.

DATA AUGMENTATION :

Now, I have a very good model : Awesome!

But, sure I could do better :
“Alfred”, my robot encounters bad inferences, when I talk to it (him?) in 2 cases :

from a 2-3 meters distance,
when it patrols in my house.

In first case, the distance produces echos, change voice wave amplitude…
In the second case, motor wheels produce noise, ground texture too…
In both cases, noises, echos, amplitude signal variations, cause bad inferences !!

How could I do ?

1/ Well, I’ll use specificity :
Alfred, you need to listen and process an echo wav sentence, due to distance ? Ok !

I’ll make you learn modified sentences (with echos inside wav’s)

record same sentence with different distances from microphone, and in different locations in room/house

Noise ? happy voice ? sad one ? --> I’ll make you learn a bit of each !!!

Change tone of your voice, speed, and different moment in a day (noise should differ)

Sure it needs a bit of time, recording new waves, to fit all scenarios, but I must say that the inference difference is very impressive : I can talk to my robot when moving, and ask it to stop, for example.

2/ Other way :
A mozilla tool to modify wav’s with echos, pitch, adding noise…

python /media/nvidia/neo_backup/voice-corpus-tool-master/voice.py

A tool to apply a series of commands to a collection of samples.
Usage: voice.py (command … [-opt1 []] [-opt2 []] …)*

Commands:

help
Display help message

add
Adds samples to current buffer
Arguments:
source: string - Name of a named buffer or filename of a CSV file or WAV file (wildcards supported)

Buffer operations:

shuffle
Randoimize order of the sample buffer

order
Order samples in buffer by length

reverse
Reverse order of samples in buffer

take
Take given number of samples from the beginning of the buffer as new buffer
Arguments:
number: int - Number of samples

repeat
Repeat samples of current buffer times as new buffer
Arguments:
number: int - How often samples of the buffer should get repeated

skip
Skip given number of samples from the beginning of current buffer
Arguments:
number: int - Number of samples

find
Drop all samples, who’s transcription does not contain a keyword
Arguments:
keyword: string - Keyword to look for in transcriptions

clear
Clears sample buffer

Named buffers:

set
Replaces named buffer with contents of buffer
Arguments:
name: string - Name of the named buffer

stash
Moves buffer to named buffer (buffer will be empty afterwards)
Arguments:
name: string - Name of the named buffer

push
Appends buffer to named buffer
Arguments:
name: string - Name of the named buffer

drop
Drops named buffer
Arguments:
name: string - Name of the named buffer

Output:

print
Prints list of samples in current buffer

play
Play samples of current buffer

write <dir_name>
Write samples of current buffer to disk
Arguments:
dir_name: string - Path to the new sample directory. The directory and a file with the same name plus extension “.csv” should not exist.

Effects:

reverb [-room_scale <room_scale>] [-hf_damping <hf_damping>] [-wet_gain <wet_gain>] [-stereo_depth <stereo_depth>] [-reverberance ] [-wet_only] [-pre_delay <pre_delay>]
Adds reverberation to buffer samples
Options:
-room_scale: float - Room scale factor (between 0.0 to 1.0)
-hf_damping: float - HF damping factor (between 0.0 to 1.0)
-wet_gain: float - Wet gain in dB
-stereo_depth: float - Stereo depth factor (between 0.0 to 1.0)
-reverberance: float - Reverberance factor (between 0.0 to 1.0)
-wet_only: bool - If to strip source signal on output
-pre_delay: int - Pre delay in ms

echo <gain_in> <gain_out> <delay_decay>
Adds an echo effect to buffer samples
Arguments:
gain_in: float - Gain in
gain_out: float - Gain out
delay_decay: string - Comma separated delay decay pairs - at least one (e.g. 10,0.1,20,0.2)

speed
Adds an speed effect to buffer samples
Arguments:
factor: float - Speed factor to apply

pitch
Adds a pitch effect to buffer samples
Arguments:
cents: int - Cents (100th of a semi-tome) of shift to apply

tempo
Adds a tempo effect to buffer samples
Arguments:
factor: float - Tempo factor to apply

sox
Adds a SoX effect to buffer samples
Arguments:
effect: string - SoX effect name
args: string - Comma separated list of SoX effect parameters (no white space allowed)

augment [-gain ] [-times ]
Augment samples of current buffer with noise
Arguments:
source: string - CSV file with samples to augment onto current sample buffer
Options:
-gain: float - How much gain (in dB) to apply to augmentation audio before overlaying onto buffer samples
-times: int - How often to apply the augmentation source to the sample buffer

Oh, I can modify a whole csv file with a lot of params…
Calling a csv file, you process on the fly modifications on every wav in csv !!!
I tested PITCH, SPEED, TEMPO, with values between (-0.05, 0.05), with very good results (test Wer/2)

…to follow!

tbozo · December 12, 2017, 1:12pm

Hi,
Any chance that you’ll release the created model ?
Would it work (usage , not creation) with a low end computer as a raspberry pi ?
it would be nice to use it in french with openjarvis.com for example.

elpimous_robot · December 12, 2017, 1:52pm

Hi, tbozo,
Well, my model wouldn’t help you, and I explain why :

the model is limited to my own voice -> it wouldn’t recognize you at all !!
the model is strictly limited to my bot crossed questions possibilities.

Yes, it should work on a RPI2-3 (but ask Gerard-majax for that)

But I plan to create a multi-speakers french model, helped with voxforge.
This one would suit you !

This model will work on one of my next robot, the QBO1 (CORPORA), based on arduino/RPI3.

Now, about openjarvis, my AI is based on Rivescript-python (very-very powerfull !! You should try it !!)

Hope this helped you.

lissyx · December 12, 2017, 9:58pm

It’s going to highly depends on your expectations, but I tend to remember that @elpimous_robot model was smallish enough that it would run not that bad on RPi2/3. I would not expect realtime though, but it might be manageable in your case.

tbozo · December 15, 2017, 5:20pm

If you need some other voices for your model I might help you.
Openjarvis is nicely packaged and easy to use (at least if you use raspbian Jessie and not stretch). I use snowboy for offline hotword detection and bing otherwise. It works quite well in french (some problems with my 7 years old child)
As for the interactions it seems that the syntax is comparable at least for basic tasks. there is a plugin for rivescript called Rivescript bot.
Unfortunatly I can’t afford time consuming project right now…
I’m a C++/Python guy at work, but at home I go for the easiest

tbozo · December 15, 2017, 5:22pm

@lissyx I was looking for a real time seems that I might need to change my hardware
I’ll stay with my bing API right now…

elpimous_robot · December 15, 2017, 10:34pm

hi,
how many fr voices do you have ? I’m interested !
For now, I only have nearly 5h of my own voice (nearly 5000 train samples…)

Working on voxforge, to recover all fr material, but it’s harder than I expected (It would take more time…)

With a standard STT, child voice is hard to recognize, due to a different frequency;

but, with deep learning, it pass this restriction.
send me private msg for specific french discussion, if you want !

DJ-Hay · January 15, 2018, 2:40am

@elpimous_robot @lissyx

Is it possible to create a new trie/language model (as explained above) using transcripts with more jargon/technical speak, then use those in conjunction with the pre-built Mozilla firefox deepspeech output_graph? I don’t have the resources to train a completely new model, but I can certainly generate the language models for my specific (technical) version of English speakers. I’m just not sure if using the pre-built output_graph with the new language model and trie will work.

mark2 · January 15, 2018, 7:42am

I cannot find program named “generate_trie”… In my DeepSpeech folder there is a subfolder named native_client, but there is only generate_trie.cpp. Should I first compile it somehow? Could you give more instructions on how I can call generate_trie?

elpimous_robot · January 15, 2018, 8:45am

Hi Mark2
To obtain the generate_trie file,
I had to compile native client !
Have a look at native_client/readme.md file
"Bazel_build…"

elpimous_robot · January 15, 2018, 8:48am

Hi Dj-Hay.
Sure. Creating a new trie file / vocabulary could help you to recognize new words/sentences.

Be sure to have a complete sentence per lign, on your vocab, and not only 1 word !!

yv001 · January 15, 2018, 9:40am

If you don’t want to do all the setup for building deepspeech from source, I’d recommend downloading mozilla’s pre-built native_client and use generate_trie command from there - see https://github.com/mozilla/DeepSpeech/tree/master/native_client

Basically running the following command should do the trick.

python util/taskcluster.py --target /path/to/destination/folder

mark2 · January 15, 2018, 10:28am

I ran the command and called pre-built generate_trie-program. However, I got a “-bash: ./generate-trie: cannot execute binary file” error, although it has execution permission for all. Is it because it was compiled on Linux and I use MacOS? Is there any workarounds or should I compile the program from sources?

lissyx · January 15, 2018, 10:36am

taskcluster.py downloads linux by default. You need to pass --arch osx as documented.

DJ-Hay · January 15, 2018, 4:37pm

Great, thanks! I’m curious as to what the neural network is doing then. Is it generating a bunch of vowel/consonant sound primitives that are fed into the trie/lm.binary? Then that trie/lm.binary decides which of the most probable words that ordering of vowel/consonant sound makes?

DJ-Hay · January 15, 2018, 4:48pm

Ah, nevermind. I think the original paper (https://arxiv.org/abs/1408.2873) does show that the DNN part is predicting characters from the alphabet. Thus the DNN will create a large chunk of letters/spaces from the given audio. Then, that will be fed into the language model (which is completely separate from the DNN) and will optimize which combination of spaces and letters make the best sentence/words. Please correct me if I’m wrong.

p.holetzky · January 17, 2018, 9:09pm

Thanks for your tutorial, I’m currently training a german model using an open source corpus, this is a big help!

I was wondering why you use your vocabulary.txt instead of your alphabet.txt in your --alphabet_config_path parameter for DeepSpeech.py?

elpimous_robot · January 17, 2018, 9:42pm

Hi. Happy to help you.

Thanks for the question : My fault !!

Of course, you link to alphabet.txt !!!

Vocabulary.txt is used for lm/trie…

If you see others errors…

See you.

elpimous_robot · January 17, 2018, 9:59pm

Yes.
In alphabet.txt, you only have symbols !!
Each symbol is a label.
Deepspeech learns each label with a lot of sounds.

Some others params lm/trie work hard to evaluate one heard sentence, and predict result inference)

jehoshua · January 21, 2018, 10:57pm

Thanks for your tutorial. We have hundreds of audio files for just one person/speaker and are considering making a specific model. Was considering breaking up each audio into single words, for training purposes. However, now I see by your comment that a complete sentence is preferred.

My thinking on using the single word approach was to significantly reduce the size of the model, as it is for one person/speaker. For example, a 19 second WAV that has 55 words has 33 unique words. Is there any advantage in using the same word by the same speaker for training the model ?

I guess my question is - how differently can one person speak one word ?