Wav’s were recorder with the following params : mono / 16 bits / 16 k.
The use of the google vad lib helped me to limit white space before/after each wav, but Deepspeech seems to process wav, and un-necessary white sound too. (But, as Kdavis told me, removing white sound before processing, limits time spent for model creation !)
MATERIAL PREPARATION :
you should have a directory with all your waves (the more, the better !!)
and a textfile containing each wav complete transcript per line (utf8 encoded)
we’ll call this textfile, the original textfile.
1 - original textfile cleaning :
open the alphabet.txt textfile,
feed in your own alphabet,
save it
open the original textfile,
with your best editor, clean the file : all its characters MUST be present in alphabet.txt
remove any punctuation, but you can keep the apostroph, if present in alphabet, as a character.
2 - create 3 directories : train, dev, test.
3 - feed each dir. with corresponding wav’s and a new transcript’s textfile, as CSV file,
containing those specific wav’s transcript…
Note about the textfiles :
you should have train.csv in the train dir, dev.csv in dev dir and test.csv in test dir
Each CSV file must start with lign: wav_filename,wav_filesize,transcript
And an example of my test.csv content:
/home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...
It seems that we should separe all wav’s with the following ratio : 70 - 20 - 10 !
70% of all wav’s content in train dir, with corresponding train.csv file,
20% in dev dir, with corresponding dev.csv file,
10% in test dir, with corresponding test.csv file.
IMPORTANT : A wav file can only appear in one directory file. It’s needed for good model creation (Otherwise, it could result in overfitting…)
LANGUAGE MODEL CREATION :
Here, we use the original textfile, containing 100% of wav’s transcripts, and we rename it vocabulary.txt
Now, run the file IN YOUR DEEPSPEECH directory : /bin/run-alfred.sh
you can leave the computer, watch an entire 12 episodes serie…before end process
IF everything worked correctly, you should now have a /model_export/output_graph.pb ,your model.
My Model : (I didn’t really respect percentage LOL)
model size (bp) : 17.9mo
train dir : 5700 wav’s (3-5s/wav)
dev dir : 1500
test dir : 170
model stopped by Early_stop param with a 0.06 Wer.
Sure I’ll do better with more material (wav’s and transcripts)
Enjoy your model with inferences.
DATA AUGMENTATION :
Now, I have a very good model : Awesome!
But, sure I could do better :
“Alfred”, my robot encounters bad inferences, when I talk to it (him?) in 2 cases :
from a 2-3 meters distance,
when it patrols in my house.
In first case, the distance produces echos, change voice wave amplitude…
In the second case, motor wheels produce noise, ground texture too…
In both cases, noises, echos, amplitude signal variations, cause bad inferences !!
How could I do ?
1/ Well, I’ll use specificity :
Alfred, you need to listen and process an echo wav sentence, due to distance ? Ok !
I’ll make you learn modified sentences (with echos inside wav’s)
record same sentence with different distances from microphone, and in different locations in room/house
Noise ? happy voice ? sad one ? → I’ll make you learn a bit of each !!!
Change tone of your voice, speed, and different moment in a day (noise should differ)
Sure it needs a bit of time, recording new waves, to fit all scenarios, but I must say that the inference difference is very impressive : I can talk to my robot when moving, and ask it to stop, for example.
2/ Other way :
A mozilla tool to modify wav’s with echos, pitch, adding noise…
A tool to apply a series of commands to a collection of samples.
Usage: voice.py (command … [-opt1 []] [-opt2 []] …)*
Commands:
help
Display help message
add
Adds samples to current buffer
Arguments:
source: string - Name of a named buffer or filename of a CSV file or WAV file (wildcards supported)
Buffer operations:
shuffle
Randoimize order of the sample buffer
order
Order samples in buffer by length
reverse
Reverse order of samples in buffer
take
Take given number of samples from the beginning of the buffer as new buffer
Arguments:
number: int - Number of samples
repeat
Repeat samples of current buffer times as new buffer
Arguments:
number: int - How often samples of the buffer should get repeated
skip
Skip given number of samples from the beginning of current buffer
Arguments:
number: int - Number of samples
find
Drop all samples, who’s transcription does not contain a keyword
Arguments:
keyword: string - Keyword to look for in transcriptions
clear
Clears sample buffer
Named buffers:
set
Replaces named buffer with contents of buffer
Arguments:
name: string - Name of the named buffer
stash
Moves buffer to named buffer (buffer will be empty afterwards)
Arguments:
name: string - Name of the named buffer
push
Appends buffer to named buffer
Arguments:
name: string - Name of the named buffer
drop
Drops named buffer
Arguments:
name: string - Name of the named buffer
Output:
print
Prints list of samples in current buffer
play
Play samples of current buffer
write <dir_name>
Write samples of current buffer to disk
Arguments:
dir_name: string - Path to the new sample directory. The directory and a file with the same name plus extension “.csv” should not exist.
Effects:
reverb [-room_scale <room_scale>] [-hf_damping <hf_damping>] [-wet_gain <wet_gain>] [-stereo_depth <stereo_depth>] [-reverberance ] [-wet_only] [-pre_delay <pre_delay>]
Adds reverberation to buffer samples
Options:
-room_scale: float - Room scale factor (between 0.0 to 1.0)
-hf_damping: float - HF damping factor (between 0.0 to 1.0)
-wet_gain: float - Wet gain in dB
-stereo_depth: float - Stereo depth factor (between 0.0 to 1.0)
-reverberance: float - Reverberance factor (between 0.0 to 1.0)
-wet_only: bool - If to strip source signal on output
-pre_delay: int - Pre delay in ms
echo <gain_in> <gain_out> <delay_decay>
Adds an echo effect to buffer samples
Arguments:
gain_in: float - Gain in
gain_out: float - Gain out
delay_decay: string - Comma separated delay decay pairs - at least one (e.g. 10,0.1,20,0.2)
speed
Adds an speed effect to buffer samples
Arguments:
factor: float - Speed factor to apply
pitch
Adds a pitch effect to buffer samples
Arguments:
cents: int - Cents (100th of a semi-tome) of shift to apply
tempo
Adds a tempo effect to buffer samples
Arguments:
factor: float - Tempo factor to apply
sox
Adds a SoX effect to buffer samples
Arguments:
effect: string - SoX effect name
args: string - Comma separated list of SoX effect parameters (no white space allowed)
augment [-gain ] [-times ]
Augment samples of current buffer with noise
Arguments:
source: string - CSV file with samples to augment onto current sample buffer
Options:
-gain: float - How much gain (in dB) to apply to augmentation audio before overlaying onto buffer samples
-times: int - How often to apply the augmentation source to the sample buffer
Oh, I can modify a whole csv file with a lot of params…
Calling a csv file, you process on the fly modifications on every wav in csv !!!
I tested PITCH, SPEED, TEMPO, with values between (-0.05, 0.05), with very good results (test Wer/2)
Hi,
Any chance that you’ll release the created model ?
Would it work (usage , not creation) with a low end computer as a raspberry pi ?
it would be nice to use it in french with openjarvis.com for example.
Hi, tbozo,
Well, my model wouldn’t help you, and I explain why :
the model is limited to my own voice -> it wouldn’t recognize you at all !!
the model is strictly limited to my bot crossed questions possibilities.
Yes, it should work on a RPI2-3 (but ask Gerard-majax for that)
But I plan to create a multi-speakers french model, helped with voxforge.
This one would suit you !
This model will work on one of my next robot, the QBO1 (CORPORA), based on arduino/RPI3.
Now, about openjarvis, my AI is based on Rivescript-python (very-very powerfull !! You should try it !!)
Hope this helped you.
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
It’s going to highly depends on your expectations, but I tend to remember that @elpimous_robot model was smallish enough that it would run not that bad on RPi2/3. I would not expect realtime though, but it might be manageable in your case.
If you need some other voices for your model I might help you.
Openjarvis is nicely packaged and easy to use (at least if you use raspbian Jessie and not stretch). I use snowboy for offline hotword detection and bing otherwise. It works quite well in french (some problems with my 7 years old child)
As for the interactions it seems that the syntax is comparable at least for basic tasks. there is a plugin for rivescript called Rivescript bot.
Unfortunatly I can’t afford time consuming project right now…
I’m a C++/Python guy at work, but at home I go for the easiest
Is it possible to create a new trie/language model (as explained above) using transcripts with more jargon/technical speak, then use those in conjunction with the pre-built Mozilla firefox deepspeech output_graph? I don’t have the resources to train a completely new model, but I can certainly generate the language models for my specific (technical) version of English speakers. I’m just not sure if using the pre-built output_graph with the new language model and trie will work.
I cannot find program named “generate_trie”… In my DeepSpeech folder there is a subfolder named native_client, but there is only generate_trie.cpp. Should I first compile it somehow? Could you give more instructions on how I can call generate_trie?
I ran the command and called pre-built generate_trie-program. However, I got a “-bash: ./generate-trie: cannot execute binary file” error, although it has execution permission for all. Is it because it was compiled on Linux and I use MacOS? Is there any workarounds or should I compile the program from sources?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
14
taskcluster.py downloads linux by default. You need to pass --arch osx as documented.
Great, thanks! I’m curious as to what the neural network is doing then. Is it generating a bunch of vowel/consonant sound primitives that are fed into the trie/lm.binary? Then that trie/lm.binary decides which of the most probable words that ordering of vowel/consonant sound makes?
Ah, nevermind. I think the original paper (https://arxiv.org/abs/1408.2873) does show that the DNN part is predicting characters from the alphabet. Thus the DNN will create a large chunk of letters/spaces from the given audio. Then, that will be fed into the language model (which is completely separate from the DNN) and will optimize which combination of spaces and letters make the best sentence/words. Please correct me if I’m wrong.
Thanks for your tutorial. We have hundreds of audio files for just one person/speaker and are considering making a specific model. Was considering breaking up each audio into single words, for training purposes. However, now I see by your comment that a complete sentence is preferred.
My thinking on using the single word approach was to significantly reduce the size of the model, as it is for one person/speaker. For example, a 19 second WAV that has 55 words has 33 unique words. Is there any advantage in using the same word by the same speaker for training the model ?
I guess my question is - how differently can one person speak one word ?