TUTORIAL : How I trained a specific french model to control my robot

caucheteux · July 10, 2019, 11:17am

Thanks a lot for your tutorial, it really helps

I have a question for you. You use sentences to train your model in order to use this model with command. Does theses sentences concern only your commands ?
I explain myself :
I want a robot who understands 10 orders.
Do I have to train my model with only sentences with the words concerned by these orders ?
Exemple: order “What time is it ?”
what kind of sentences do I have to had in my wav files ?

Last question: Do single words orders means single words sentences for training ?

Thanks again for this tutorial, I hope my question is not too blur

elpimous_robot · July 10, 2019, 12:56pm

Hi.
What robot do you have??

Well, you need 10 orders… Only?
With my robot, working on social approach, so I make it learn same order, asked with different forms…
Ex : what time is it - what is the time - could you tell me the time…

For single words like stop - yes - no…
I’ll suggest you to record only the word.
The lm works like probabilities.
If “stop” is learnt into a whole sentence, (ex : I want you to stop now)
Perhaps that the model could interpret a noise as a possible word, (near the stop word learnt into the sentence) (ex : to stop, or stop now)

A simple thing to keep in mind :
The more the possibilities, the more the error risk in inferences.

What I ll do :
For a few limited orders,
Record same sentences, max 5s
Vary intonations
Vary the environment noise (rooms, walls, front, back…up, down…)
Vary the location of record (echos difficult to hear in records, but heard by robot).
With a robot use, record the robot noise too (if wheels, record at different speeds… Same for dynamixels…)…
To avoid overfitting in learning, perhaps record nearly 50 to 100 sentences per order minimum.

Use only your few limited orders for lm and trie build.

You should obtain a small model with good results…

Hope that you have a good robot microphone.
(mine is a respeaker 4 mic array)

Hope it will help you…

PS : hi to everyone…

Vincent

caucheteux · July 11, 2019, 9:18am

Hi,

Thanks for your quick reply ! Lot of useful stuff in there.

The final goal is for a robot (don’t know what type yet, still pretty in discussion), so for now it’ll be an app on a tablet. 100% of the tablet resources will be for this app.

For now, we only need 10-20 orders yeah. Such as “Register”, the tablet know that “register” mean “register the next 10 min of your CPU usage” per example

I though of using combination between my commands words and what i call phoneme-words. It’s a group of words where each of them represent a particular phonemes of the french language, like [é] or [p]. The goal is for my model to know all the phonemes and have a quality recognition. But what you said :

makes me doubt this method.

If I get it right, you mean records “Register” with a lot of variation such as noise, radio noise, emotion, distance, intensity of the voice…

That’s mean my lm is a unigram ?

Haven’t though of the microphone quality as I start using audacity and modifying my records on it (and respects the caracteristics needed mono, 16kHz and 16 bits), but I will

I forgot to say, it’s a poly-speaker app. So can I count the 50-100 variations of sentences per order per speaker or does the speaker is another variation ? I don’t want my recognizer to be speaker-blocked…

Thanks again, your help is very useful ! It’s a vast subject and reading your tutorial and response helps me get in the right direction

ctzogka · July 11, 2019, 2:03pm

Hi, you have done a great job here. I have learned a lot of things from this thread, it’s a perfect tutorial especially for a beginner like me!
My model is working so i decide to extend my dataset using data augmentation technique. Specifically i want to increase speech speed but when i call voice-corpus-tool help command i can’t find the parameter “speed” like you mentioned. I would appreciated it if you give an example of the command that you used in your case.

caucheteux · July 11, 2019, 2:29pm

Hey @ctzogka,

Could you tell us more about your use case, your data, the different steps and difficulties you encounter ? I might help people if you share theses informations

Thanks !

lissyx · July 11, 2019, 2:58pm

ctzogka · July 12, 2019, 7:01am

Hey @caucheteux,
Firstly, I have been trying to fine-tune english model (v0.5.1) with Common Voice data. Later i plan to build my own Greek model, i have already given a try and i saw that following this tutorial’s steps i can build it successfully. However, training might bring upon difficulties. Now, i am working on the first mentioned issue (fine-tuning), you can find more details about my parameters and my difficulties on my last post https://discourse.mozilla.org/t/fine-tuning-deepspeech-model-commonvoice-data/41872/8.
The main problem was to find the appropriate learning rate. Using large values (1e-3 / 1e-4) training & validation loss reach infinity. Finally, i should refer that my basic purpose is decoding large files (e.g. 10 minutes clip of BBC-news). Right now i have WER percentage about 50% and i decided to extend training set (CV-data) increasing the speed of existing files and adding them in training data.
Can anyone help me?
Thank you !

caucheteux · July 12, 2019, 8:17am

hi @ctzogka,

I remember now ! I saw your problematic few days ago, it is very interesting as you treat subject like long recordings, fine tuning, and new language model.

you mean, start from a sentences in a file of 10 sec to a file with the same sentence of 5 sec for example ? What’s the goal ? Adapt your model to people who speak quickly ? Or it’s just to add a variation and enrich your dataset ?

ctzogka · July 12, 2019, 10:25am

This is exactly my goal, i want to boost model’s performance on quick speech so i decided to take same training samples, increase their speed and add them in my training set.

ctzogka · July 12, 2019, 10:40am

Also, i would like to hear your opinion for this thought.
i.e. is there any chance to boost the performance using 120.000 + extra samples (with speed 1.8)?
It’s about 1/5 of my cv-training dataset ~ 150 hours…

elpimous_robot · July 12, 2019, 11:14am

@ctzogka.
No.

A 1.8 factor is tooooo big.
Yes, the computer can do this and yes your human ears will “adapt” result…

But the accelerated sentence is not realistic (human) and a real human speaking fast will not correspond to a too much accelerated wav.

I think that you should not go higher than x 1.2

Or use a real speaking fast human locutor

elpimous_robot · July 12, 2019, 11:26am

@caucheteux.
“Arghhh la loose…”
Well, multi locutor!!!

All what I told you was correct for a mono speaker.

For a multi locutor, it’s very very different. Exponential.

You’ll need thousand wav’s for 5 speakers for 15 to 20 orders, I think

My tests.
10 hours of my voice,
Same voca with my 10 yo son… 1 hour
Quite different in prosody and octave…
Very bad results.

Now, if you only use single word order, so 10 to 15 different words only.

You could try with 100 to 200 samples for each word of each speaker,
Augment all the data with high/low filter, some noise, echos, speed (<=1. 2).

ctzogka · July 12, 2019, 11:27am

Thank you @elpimous_robot for the response and your advice. Could you give me the command you applied in your case with voice-corpus-tool? I can’t find how to set the speed parameter… At this point i will experiment increasing speed on cv data (mono speaker - short sentences <10 sec ) and testing on large bbc-news clips (10minutes) to check if it outperform my previous model (WER 50%).

caucheteux · July 15, 2019, 7:57am

hey @elpimous_robot,
thaks again for your response, always helpful

Yeah, I’m pretty screwed… haha Will be easier if my problematic was the same as yours

So if I get this right, you mean:

for each speaker:
- for each word/order, 100-200 recordings

So with 10 orders of one words and 5 speaker: ~ 5000-10000 recordings ?

And last question: did this model will only work for the 5 recorded speaker or also for never recorded speaker ? I mean, do I have to record every new speaker who will be using the app and retrain the model with his recordings ?

Thank you !

elpimous_robot · July 15, 2019, 1:48pm

Hi @caucheteux

Yes, you re right.

And yes, this model will only work for your 5 speakers!!

To become open, you should record a lot lot of speakers…

elpimous_robot · July 15, 2019, 2:30pm

Well, I m not expert, but I think that the data needs are exponential, regarding to users!!

Ex : same as simple code numbers possibilities… [] for 3 numbers are not proportional to [] for 4 or 5…

If you add a new user, sure you ll have to record it voice, but you ll need too to grow up your global data.

More possibilities, more datas, more learning

caucheteux · July 15, 2019, 2:48pm

Thank you for your response @elpimous_robot

That’s my concern, as it might be problematic to need 1h of recordings for each new speaker and re-train the model.

My idea is that if I get enough variety of speaker, I can have an acceptable versatility for my model and therefore don"t need to re-train the model everytime I have a new user

Yes ! The issue is to make a compromise between time to create dataset (and ressources) and quality of the model trained with this dataset…

I’m thinking of reducing the number of recordings per speaker per words/orders but increase the number of speaker. Does it seems a good way to reduce the amount of time needed per speaker without significatively decreasing the quality of my model ?

may be @lissyx or @reuben can help if they have any idea ?

Thanks again, every message is really helpful

elpimous_robot · July 15, 2019, 2:57pm

You need an English model, right?
If yes, why not use the whole model, and create a lm containing only the words you need.
@reuben, I think it could work, no?

caucheteux · July 15, 2019, 3:04pm

No, I’m building a french model unfortunately…

On parallel I’m exploring the possibility to use the mozilla model for an English model yes so your idea is welcome

rashed.genuity · July 25, 2019, 5:22am

Dear All, Thanks for this amazing instructions, I have prepared my own voice file to train a specific domain data for Bangla Language. I have created all the file according to the instruction, when I run the .sh file after 1-2 hours training it produce an error

" Fatal Python error: Segmentation fault

Thread 0x00007f3a569bd700 (most recent call first):
File “/usr/lib64/python3.6/threading.py”, line 295 in wait
File “/usr/lib64/python3.6/queue.py”, line 164 in get
File “/home/venvs/projectSTT/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 159 in run
File “/usr/lib64/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib64/python3.6/threading.py”, line 884 in _bootstrap"

Can anyone help me how I can successfully train my model ?