Tutorial How to build your homemade deepspeech model from scratch
Adapt links and params with your needs…
For my robotic project, I needed to create a small monospeaker model, with nearly 1000 sentences orders (not just single word !)
I recorded wav’s with a Respeaker Microphone Array :
Wav’s were recorder with the following params : mono / 16 bits / 16 k.
The use of the google vad lib helped me to limit white space before/after each wav, but Deepspeech seems to process wav, and un-necessary white sound too. (But, as Kdavis told me, removing white sound before processing, limits time spent for model creation !)
MATERIAL PREPARATION :
you should have a directory with all your waves (the more, the better !!)
and a textfile containing each wav complete transcript per line (utf8 encoded)
we’ll call this textfile, the original textfile.
1 - original textfile cleaning :
- open the alphabet.txt textfile,
- feed in your own alphabet,
- save it
- open the original textfile,
- with your best editor, clean the file : all its characters MUST be present in alphabet.txt
remove any punctuation, but you can keep the apostroph, if present in alphabet, as a character.
2 - create 3 directories : train, dev, test.
3 - feed each dir. with corresponding wav’s and a new transcript’s textfile, as CSV file,
containing those specific wav’s transcript…
Note about the textfiles :
you should have train.csv in the train dir, dev.csv in dev dir and test.csv in test dir
Each CSV file must start with lign:
And an example of my test.csv content:
/home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...
- It seems that we should separe all wav’s with the following ratio : 70 - 20 - 10 !
70% of all wav’s content in train dir, with corresponding train.csv file,
20% in dev dir, with corresponding dev.csv file,
10% in test dir, with corresponding test.csv file.
IMPORTANT : A wav file can only appear in one directory file.
It’s needed for good model creation (Otherwise, it could result in overfitting…)
LANGUAGE MODEL CREATION :
Here, we use the original textfile, containing 100% of wav’s transcripts, and we rename it vocabulary.txt
We’ll use the powerfull Kenlm tools for our LM build : http://kheafield.com/code/kenlm/estimation/
1 - Creating arpa file for binary build :
/bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3
I asked Kenneth Heafield about -o param (“order model to estimate”)
It seems that for small corpus (my case), a value from 3 to 4 seems to be the best way to success
See lmplz params on web link, if needed.
2 - creating binary file :
/bin/bin/./build_binary -T -s words.arpa lm.binary
TRIE CREATION :
We’ll use the native_client “generate_trie” binary to create our trie file,
-creating trie file :
Adapt your links !
RUN MODEL CREATION :
Verify your directories :
2 - Write your run file :
if [ ! -f DeepSpeech.py ]; then
echo "Please make sure you run this from DeepSpeech's top level directory."
python -u DeepSpeech.py \
--train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv \
--dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv \
--test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv \
--train_batch_size 80 \
--dev_batch_size 80 \
--test_batch_size 40 \
--n_hidden 375 \
--epoch 33 \
--validation_step 1 \
--early_stop True \
--earlystop_nsteps 6 \
--estop_mean_thresh 0.1 \
--estop_std_thresh 0.1 \
--dropout_rate 0.22 \
--learning_rate 0.00095 \
--report_count 100 \
--use_seq_length False \
--export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ \
--checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ \
--decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
--alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/alphabet.txt \
--lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary \
--lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie \
Adapt links and params to fit your needs…
Now, run the file IN YOUR DEEPSPEECH directory :
you can leave the computer, watch an entire 12 episodes serie…before end process
IF everything worked correctly, you should now have a
/model_export/output_graph.pb ,your model.
My Model : (I didn’t really respect percentage LOL)
- model size (bp) : 17.9mo
- train dir : 5700 wav’s (3-5s/wav)
- dev dir : 1500
- test dir : 170
- model stopped by Early_stop param with a 0.06 Wer.
Sure I’ll do better with more material (wav’s and transcripts)
Enjoy your model with inferences. Vincent FOUCAULT