TUTORIAL : How I trained a specific french model to control my robot

elpimous_robot · January 22, 2018, 8:35am

Hi. JHOSHUA
I give you and easy answer :
Do a test :
Record 2 words, with same tone and duration,
Open both files in audacity and zoom them.
Your eyes will detect variations.
And we’re only thinking of your voice…
Our environment is really noizzy.

Keep in mind that your computer is a bit silly : for it, variations = different.

The more sounds per character,the easier for the silly pc to recognize…

Now logic sentences are imperative for trie build, to help deepspeech to process a good inference
Hope to help.

Oh, I forgot a part of your question : record differently sentences.
I’ll update the tuto this afternoon.

mark2 · January 22, 2018, 12:53pm

Usually audio corpuses publically available are in much larger files than 3-5 seconds. If I am training my own model, will Deep speech learn from the files, say 10-15 minutes long?

Of course I can split those big files into shorter by using some voice activity detection tool, but they are not perfect… so, as the result I might get sentences split in between, and in any case it requires much more manual work i.e adjusting audio files and transcripts, etc…

elpimous_robot · January 22, 2018, 1:25pm

Hi Mark2.

I had same idea, but…

Depending on silences duration variations, I think that errors can easily happen.

So, imagine that a gap happens about a sound, regarding to a character…
This would build a very bad model…

I think the best way is specificity : inferences are done with wav’s near 5s, so, train files nearly 5-10s max seems to be the correct way.(not too much risks of gap in learning)

The best way to cut silences is VAD, and human control…good luck

Have a look at Kaldi, They work on separation words…

Or ask Deepspeech team about their process to create this perfect model.

jehoshua · January 23, 2018, 1:00am

Thanks. Yes, a test with Audacity, the differences were quite recognisable. I will have to look into how to break up an audio into (say) 10 second slices and ensure words are not cut off. There are a few posts here on Discourse regards that.

I did have a quick look at “audiogrep” ( https://github.com/antiboredom/audiogrep ) yesterday, but there was an error preventing me from continuing. It doesn’t look like it has been maintained for a while ?

jehoshua · January 23, 2018, 1:06am

I used Audacity recently to remove some noise in a WAV file. Considering the audios that we need to process here, there would be considerable gaps in the audio, as the speaker is pausing/waiting. It would take a while to manually go through the audio and remove those gaps. Are there any tools that can process an audio and remove (say) gaps longer than 5 seconds ?

yv001 · January 23, 2018, 9:36am

try sox tool and its silence effect, similar issue resolved in this stackoverflow topic

elpimous_robot · January 23, 2018, 6:10pm

ADDON in first post. Hope it will help !

arkhalid1 · January 23, 2018, 9:32pm

Hi @elpimous_robot
Great tutorial and discussion. I am trying to train on 5000 utterances and it is taking a couple of hours per epoch. Can you share what configuration you used and how long each epoch took? Thanks for the help.

mansurul1985 · January 25, 2018, 2:32am

Hi, Great tutorial… May I know how you french alphabet.txt looks like? Thanks

elpimous_robot · January 25, 2018, 8:11pm

Hi, Mansurul1985.

Yes of course, but keep in mind that it’s for a robot AI (so simplified one, and some tricks to limit bad inferences results.

alphabet.txt :

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
# FOR FRENCH LIMITED CORPUS - JUST WORKING IN SOUND PERCEPTION - A BOT WILL ANALYSE RESULTS
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
'
é
è
ç
-
# The last (non-comment) line needs to end with a newline.

elpimous_robot · January 25, 2018, 8:24pm

Hello, Phanthanhlong7695.

did you compile kenlm utils ?? (needed if you want to do more than using existing model)
http://kheafield.com/code/kenlm/estimation/
Compilation will give you the binaries you want.
Hope to help.

phanthanhlong7695 · January 26, 2018, 3:08am

it done. i fixed this

phanthanhlong7695 · January 26, 2018, 3:13am

and how can i create trie file .

jehoshua · January 28, 2018, 4:46am

I’ve been able to use a Python tool to cut a WAV into word chunks - Longer audio files with Deep Speech - #7 by jehoshua

The audio outputs range from 1 second to 49 seconds. How will the longer (than 3 to 5 seconds) audio lengths affect the building of a model ?

elpimous_robot · January 28, 2018, 9:11am

Hi.
Well, as you can see in the deepspeech process,
A wav is cut using miliseconds.
Each part of the audio cut is “linked” to a vocabulary word character, and both are sent to “builder”.

There is a big error risks in this process, because a really small gap could result in lots of errors. (Big gap, characters errors…)

So, a small wav file, nearly 5s is the “best” compromise.

You could think : “so, I’ll use wav’s about 1 word only, to avoid gap”

It’s not a good idea : starting a word and continue a word after a previous one doesn’t produce same wave form (amplitude) beginning.
Ex: “hello”, "I say hello"
Often, the waveform beginning is highter in a start word.

Don’t hesitate to share with us your tests.

elpimous_robot · January 29, 2018, 3:37pm

Please be more explicit because I don’t understand your question.

jehoshua · January 30, 2018, 2:32am

Yes, and I appreciate your thread here is based on building the wav files used for training, by speaking. However, that is not always the case, as sometimes we may want to do the ‘same’ type of building (i.e. build our own models), but the WAV sources used are all from a WAV file. Hence the need to cut a WAV file into small pieces, and attempting to keep words within each cut.

That is, no broken words.

As you say, a small WAV of maximum duration of 5 seconds is ideal. I have been testing the "Python interface to the WebRTC Voice Activity Detector " at GitHub - wiseman/py-webrtcvad: Python interface to the WebRTC Voice Activity Detector

There is a python script there, example.py and I ran it against a 10 minute WAV file. The results were 56 WAV files, duration range from 00.63 seconds to 49.38 seconds.

Then the author of that package advised how to cut down the range duration size, as 49.38 seconds is a long way from your recommendation of 5 seconds max. The results then were 243 WAV files, duration range from 00.18 seconds to 13.44 seconds.

Of course some of those smaller duration sized WAV are just noise and even no noise, at least not that I could hear. Some 2 or 3 worded WAV’s were only 2 seconds long and there are quite a few that are just 1 word in duration.

Of those 243 WAV files, there are only 31 that exceed your recommendation of 5 seconds though, so that seems encouraging.

elpimous_robot · January 30, 2018, 6:28am

Very good, Jehoshua.
Train it and tell us about wer…

mark2 · February 12, 2018, 12:29pm

After running the command:

/bin/bin/./build_binary -T -s words.arpa lm.binary

I get for some (but not all) vocabularies the following error:

vocab.cc:305 in void lm::ngram::MissingSentenceMarker(const lm::ngram::Config&, const char*) threw SpecialWordMissingException.
The ARPA file is missing < /s > and the model is configured to reject these models. Run build_binary -s to disable this check. Byte: 106432571
ERROR"

Do you know what causes it?

elpimous_robot · February 12, 2018, 7:56pm

Hi Mark2.
I think you should ask to Kenneth, the creator of kenlm tools :
http://kheafield.com/code/kenlm
It’s a lm problem, regarding to silences.
I saw issues on it github, if I remember !

Did you add silences in your “file”.txt, before converting to arpa ?
Me, no !
I just added a sentence per lign, without punctuation
I didn’t have any problems
Good luck

Topic		Replies	Views
Tune MoziilaDeepSpeech to recognize specific sentences DeepSpeech	76	11545	March 25, 2023
Using Deep Speech DeepSpeech	34	12898	August 20, 2019
DeepSpeech model training DeepSpeech	65	8061	November 12, 2019
Train model but actual prediction is too poor DeepSpeech	53	1687	May 5, 2020
Training Vietnamese model DeepSpeech	33	3597	May 21, 2019

TUTORIAL : How I trained a specific french model to control my robot

Related topics