TUTORIAL : How I trained a specific french model to control my robot

DJ-Hay · January 15, 2018, 4:37pm

Great, thanks! I’m curious as to what the neural network is doing then. Is it generating a bunch of vowel/consonant sound primitives that are fed into the trie/lm.binary? Then that trie/lm.binary decides which of the most probable words that ordering of vowel/consonant sound makes?

DJ-Hay · January 15, 2018, 4:48pm

Ah, nevermind. I think the original paper (https://arxiv.org/abs/1408.2873) does show that the DNN part is predicting characters from the alphabet. Thus the DNN will create a large chunk of letters/spaces from the given audio. Then, that will be fed into the language model (which is completely separate from the DNN) and will optimize which combination of spaces and letters make the best sentence/words. Please correct me if I’m wrong.

p.holetzky · January 17, 2018, 9:09pm

Thanks for your tutorial, I’m currently training a german model using an open source corpus, this is a big help!

I was wondering why you use your vocabulary.txt instead of your alphabet.txt in your --alphabet_config_path parameter for DeepSpeech.py?

elpimous_robot · January 17, 2018, 9:42pm

Hi. Happy to help you.

Thanks for the question : My fault !!

Of course, you link to alphabet.txt !!!

Vocabulary.txt is used for lm/trie…

If you see others errors…

See you.

elpimous_robot · January 17, 2018, 9:59pm

Yes.
In alphabet.txt, you only have symbols !!
Each symbol is a label.
Deepspeech learns each label with a lot of sounds.

Some others params lm/trie work hard to evaluate one heard sentence, and predict result inference)

jehoshua · January 21, 2018, 10:57pm

Thanks for your tutorial. We have hundreds of audio files for just one person/speaker and are considering making a specific model. Was considering breaking up each audio into single words, for training purposes. However, now I see by your comment that a complete sentence is preferred.

My thinking on using the single word approach was to significantly reduce the size of the model, as it is for one person/speaker. For example, a 19 second WAV that has 55 words has 33 unique words. Is there any advantage in using the same word by the same speaker for training the model ?

I guess my question is - how differently can one person speak one word ?

elpimous_robot · January 22, 2018, 8:38am

Hi. JHOSHUA
I give you and easy answer :
Do a test :
Record 2 words, with same tone and duration,
Open both files in audacity and zoom them.
Your eyes will detect variations.
And we’re only thinking of your voice…
Our environment is really noizzy.

Keep in mind that your computer is a bit silly : for it, variations = different.

The more sounds per character,the easier for the silly pc to recognize…

Now logic sentences are imperative for trie build, to help deepspeech to process a good inference
Hope to help.

Oh, I forgot a part of your question : record differently sentences.
I’ll update the tuto this afternoon.

mark2 · January 22, 2018, 12:53pm

Usually audio corpuses publically available are in much larger files than 3-5 seconds. If I am training my own model, will Deep speech learn from the files, say 10-15 minutes long?

Of course I can split those big files into shorter by using some voice activity detection tool, but they are not perfect… so, as the result I might get sentences split in between, and in any case it requires much more manual work i.e adjusting audio files and transcripts, etc…

elpimous_robot · January 22, 2018, 2:08pm

Hi Mark2.

I had same idea, but…

Depending on silences duration variations, I think that errors can easily happen.

So, imagine that a gap happens about a sound, regarding to a character…
This would build a very bad model…

I think the best way is specificity : inferences are done with wav’s near 5s, so, train files nearly 5-10s max seems to be the correct way.(not too much risks of gap in learning)

The best way to cut silences is VAD, and human control…good luck

Have a look at Kaldi, They work on separation words…

Or ask Deepspeech team about their process to create this perfect model.

jehoshua · January 23, 2018, 1:00am

Thanks. Yes, a test with Audacity, the differences were quite recognisable. I will have to look into how to break up an audio into (say) 10 second slices and ensure words are not cut off. There are a few posts here on Discourse regards that.

I did have a quick look at “audiogrep” ( https://github.com/antiboredom/audiogrep ) yesterday, but there was an error preventing me from continuing. It doesn’t look like it has been maintained for a while ?

jehoshua · January 23, 2018, 1:06am

I used Audacity recently to remove some noise in a WAV file. Considering the audios that we need to process here, there would be considerable gaps in the audio, as the speaker is pausing/waiting. It would take a while to manually go through the audio and remove those gaps. Are there any tools that can process an audio and remove (say) gaps longer than 5 seconds ?

yv001 · January 23, 2018, 9:38am

try sox tool and its silence effect, similar issue resolved in this stackoverflow topic

elpimous_robot · January 23, 2018, 6:10pm

ADDON in first post. Hope it will help !

arkhalid1 · January 23, 2018, 9:32pm

Hi @elpimous_robot
Great tutorial and discussion. I am trying to train on 5000 utterances and it is taking a couple of hours per epoch. Can you share what configuration you used and how long each epoch took? Thanks for the help.

mansurul1985 · January 25, 2018, 2:32am

Hi, Great tutorial… May I know how you french alphabet.txt looks like? Thanks

elpimous_robot · January 25, 2018, 8:14pm

Hi, Mansurul1985.

Yes of course, but keep in mind that it’s for a robot AI (so simplified one, and some tricks to limit bad inferences results.

alphabet.txt :

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
# FOR FRENCH LIMITED CORPUS - JUST WORKING IN SOUND PERCEPTION - A BOT WILL ANALYSE RESULTS
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
'
é
è
ç
-
# The last (non-comment) line needs to end with a newline.

elpimous_robot · January 25, 2018, 8:24pm

Hello, Phanthanhlong7695.

did you compile kenlm utils ?? (needed if you want to do more than using existing model)
http://kheafield.com/code/kenlm/estimation/
Compilation will give you the binaries you want.
Hope to help.

phanthanhlong7695 · January 26, 2018, 3:12am

it done. i fixed this

phanthanhlong7695 · January 26, 2018, 3:17am

and how can i create trie file .

jehoshua · January 28, 2018, 4:46am

I’ve been able to use a Python tool to cut a WAV into word chunks - Longer audio files with Deep Speech

The audio outputs range from 1 second to 49 seconds. How will the longer (than 3 to 5 seconds) audio lengths affect the building of a model ?