Tutorial How to build your homemade deepspeech model from scratch
Adapt links and params with your needs…
For my robotic project, I needed to create a small monospeaker model, with nearly 1000 sentences orders (not just single word !)
I recorded wav’s with a Respeaker Microphone Array :
https://www.seeedstudio.com/ReSpeaker-Mic-Array-Far-field-w%2F-7-PDM-Microphones-p-2719.html
Wav’s were recorder with the following params : mono / 16 bits / 16 k.
The use of the google vad lib helped me to limit white space before/after each wav, but Deepspeech seems to process wav, and un-necessary white sound too. (But, as Kdavis told me, removing white sound before processing, limits time spent for model creation !)
MATERIAL PREPARATION :
-
you should have a directory with all your waves (the more, the better !!)
-
and a textfile containing each wav complete transcript per line (utf8 encoded)
we’ll call this textfile, the original textfile.
1 - original textfile cleaning :
- open the alphabet.txt textfile,
- feed in your own alphabet,
- save it
- open the original textfile,
- with your best editor, clean the file : all its characters MUST be present in alphabet.txt
remove any punctuation, but you can keep the apostroph, if present in alphabet, as a character.
2 - create 3 directories : train, dev, test.
3 - feed each dir. with corresponding wav’s and a new transcript’s textfile, as CSV file,
containing those specific wav’s transcript…
Note about the textfiles :
-
you should have train.csv in the train dir, dev.csv in dev dir and test.csv in test dir
-
Each CSV file must start with lign:
wav_filename,wav_filesize,transcript -
And an example of my test.csv content:
/home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...
- It seems that we should separe all wav’s with the following ratio : 70 - 20 - 10 !
70% of all wav’s content in train dir, with corresponding train.csv file,
20% in dev dir, with corresponding dev.csv file,
10% in test dir, with corresponding test.csv file.
IMPORTANT : A wav file can only appear in one directory file.
It’s needed for good model creation (Otherwise, it could result in overfitting…)
LANGUAGE MODEL CREATION :
Here, we use the original textfile, containing 100% of wav’s transcripts, and we rename it vocabulary.txt
We’ll use the powerfull Kenlm tools for our LM build : http://kheafield.com/code/kenlm/estimation/
DONT FORGET TO COMPILE, OTHERWISE YOU WILL NOT FIND BINARIES
1 - Creating arpa file for binary build :
/bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3
I asked Kenneth Heafield about -o param (“order model to estimate”)
It seems that for small corpus (my case), a value from 3 to 4 seems to be the best way to success
See lmplz params on web link, if needed.
2 - creating binary file :
/bin/bin/./build_binary -T -s words.arpa lm.binary
TRIE CREATION :
We’ll use the native_client “generate_trie” binary to create our trie file,
-creating trie file :
Adapt your links !
/home/nvidia/tensorflow/bazel-bin/native_client/generate_trie /
/home/nvidia/DeepSpeech/data/alphabet.txt /
/home/nvidia/DeepSpeech/data/lm.binary /
/home/nvidia/DeepSpeech/data/vocabulary.txt /
/home/nvidia/DeepSpeech/data/trie
RUN MODEL CREATION :
Verify your directories :
-
TRAIN
train.csv
record.1.wav
record.2.wav…(remember : all wav’s are different)
-
DEV
dev.csv
record.1.wav
record.2.wav…
-
TEST
test.csv
record.1.wav
record.2.wav…
-
vocabulary.txt
-
lm.binary
-
trie
2 - Write your run file :
#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
echo "Please make sure you run this from DeepSpeech's top level directory."
exit 1
fi;
python -u DeepSpeech.py \
--train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
--dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
--test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
--train_batch_size 80 \
--dev_batch_size 80 \
--test_batch_size 40 \
--n_hidden 375 \
--epoch 33 \
--validation_step 1 \
--early_stop True \
--earlystop_nsteps 6 \
--estop_mean_thresh 0.1 \
--estop_std_thresh 0.1 \
--dropout_rate 0.22 \
--learning_rate 0.00095 \
--report_count 100 \
--use_seq_length False \
--export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
--checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
--decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
--alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/alphabet.txt \
--lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary \
--lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie \
"$@"
Adapt links and params to fit your needs…
Now, run the file IN YOUR DEEPSPEECH directory :
/bin/run-alfred.sh
you can leave the computer, watch an entire 12 episodes serie…before end process
IF everything worked correctly, you should now have a /model_export/output_graph.pb
,your model.
My Model : (I didn’t really respect percentage LOL)
- model size (bp) : 17.9mo
- train dir : 5700 wav’s (3-5s/wav)
- dev dir : 1500
- test dir : 170
- model stopped by Early_stop param with a 0.06 Wer.
Sure I’ll do better with more material (wav’s and transcripts)
Enjoy your model with inferences.
DATA AUGMENTATION :
Now, I have a very good model : Awesome!
But, sure I could do better :
“Alfred”, my robot encounters bad inferences, when I talk to it (him?) in 2 cases :
- from a 2-3 meters distance,
- when it patrols in my house.
In first case, the distance produces echos, change voice wave amplitude…
In the second case, motor wheels produce noise, ground texture too…
In both cases, noises, echos, amplitude signal variations, cause bad inferences !!
How could I do ?
1/ Well, I’ll use specificity :
Alfred, you need to listen and process an echo wav sentence, due to distance ? Ok !
I’ll make you learn modified sentences (with echos inside wav’s)
- record same sentence with different distances from microphone, and in different locations in room/house
Noise ? happy voice ? sad one ? --> I’ll make you learn a bit of each !!!
- Change tone of your voice, speed, and different moment in a day (noise should differ)
Sure it needs a bit of time, recording new waves, to fit all scenarios, but I must say that the inference difference is very impressive : I can talk to my robot when moving, and ask it to stop, for example.
2/ Other way :
A mozilla tool to modify wav’s with echos, pitch, adding noise…
python /media/nvidia/neo_backup/voice-corpus-tool-master/voice.py
A tool to apply a series of commands to a collection of samples.
Usage: voice.py (command … [-opt1 []] [-opt2 []] …)*Commands:
help
Display help messageadd
Adds samples to current buffer
Arguments:
source: string - Name of a named buffer or filename of a CSV file or WAV file (wildcards supported)Buffer operations:
shuffle
Randoimize order of the sample bufferorder
Order samples in buffer by lengthreverse
Reverse order of samples in buffertake
Take given number of samples from the beginning of the buffer as new buffer
Arguments:
number: int - Number of samplesrepeat
Repeat samples of current buffer times as new buffer
Arguments:
number: int - How often samples of the buffer should get repeatedskip
Skip given number of samples from the beginning of current buffer
Arguments:
number: int - Number of samplesfind
Drop all samples, who’s transcription does not contain a keyword
Arguments:
keyword: string - Keyword to look for in transcriptionsclear
Clears sample bufferNamed buffers:
set
Replaces named buffer with contents of buffer
Arguments:
name: string - Name of the named bufferstash
Moves buffer to named buffer (buffer will be empty afterwards)
Arguments:
name: string - Name of the named bufferpush
Appends buffer to named buffer
Arguments:
name: string - Name of the named bufferdrop
Drops named buffer
Arguments:
name: string - Name of the named bufferOutput:
Prints list of samples in current bufferplay
Play samples of current bufferwrite <dir_name>
Write samples of current buffer to disk
Arguments:
dir_name: string - Path to the new sample directory. The directory and a file with the same name plus extension “.csv” should not exist.Effects:
reverb [-room_scale <room_scale>] [-hf_damping <hf_damping>] [-wet_gain <wet_gain>] [-stereo_depth <stereo_depth>] [-reverberance ] [-wet_only] [-pre_delay <pre_delay>]
Adds reverberation to buffer samples
Options:
-room_scale: float - Room scale factor (between 0.0 to 1.0)
-hf_damping: float - HF damping factor (between 0.0 to 1.0)
-wet_gain: float - Wet gain in dB
-stereo_depth: float - Stereo depth factor (between 0.0 to 1.0)
-reverberance: float - Reverberance factor (between 0.0 to 1.0)
-wet_only: bool - If to strip source signal on output
-pre_delay: int - Pre delay in msecho <gain_in> <gain_out> <delay_decay>
Adds an echo effect to buffer samples
Arguments:
gain_in: float - Gain in
gain_out: float - Gain out
delay_decay: string - Comma separated delay decay pairs - at least one (e.g. 10,0.1,20,0.2)speed
Adds an speed effect to buffer samples
Arguments:
factor: float - Speed factor to applypitch
Adds a pitch effect to buffer samples
Arguments:
cents: int - Cents (100th of a semi-tome) of shift to applytempo
Adds a tempo effect to buffer samples
Arguments:
factor: float - Tempo factor to applysox
Adds a SoX effect to buffer samples
Arguments:
effect: string - SoX effect name
args: string - Comma separated list of SoX effect parameters (no white space allowed)augment [-gain ] [-times ]
Augment samples of current buffer with noise
Arguments:
source: string - CSV file with samples to augment onto current sample buffer
Options:
-gain: float - How much gain (in dB) to apply to augmentation audio before overlaying onto buffer samples
-times: int - How often to apply the augmentation source to the sample buffer
Oh, I can modify a whole csv file with a lot of params…
Calling a csv file, you process on the fly modifications on every wav in csv !!!
I tested PITCH, SPEED, TEMPO, with values between (-0.05, 0.05), with very good results (test Wer/2)
…to follow!