I have a question for you. You use sentences to train your model in order to use this model with command. Does theses sentences concern only your commands ?
I explain myself :
I want a robot who understands 10 orders.
Do I have to train my model with only sentences with the words concerned by these orders ?
Exemple: order “What time is it ?”
what kind of sentences do I have to had in my wav files ?
Last question: Do single words orders means single words sentences for training ?
Thanks again for this tutorial, I hope my question is not too blur
Well, you need 10 orders… Only?
With my robot, working on social approach, so I make it learn same order, asked with different forms…
Ex : what time is it - what is the time - could you tell me the time…
For single words like stop - yes - no…
I’ll suggest you to record only the word.
The lm works like probabilities.
If “stop” is learnt into a whole sentence, (ex : I want you to stop now)
Perhaps that the model could interpret a noise as a possible word, (near the stop word learnt into the sentence) (ex : to stop, or stop now)
A simple thing to keep in mind :
The more the possibilities, the more the error risk in inferences.
What I ll do :
For a few limited orders,
Record same sentences, max 5s
Vary intonations
Vary the environment noise (rooms, walls, front, back…up, down…)
Vary the location of record (echos difficult to hear in records, but heard by robot).
With a robot use, record the robot noise too (if wheels, record at different speeds… Same for dynamixels…)…
To avoid overfitting in learning, perhaps record nearly 50 to 100 sentences per order minimum.
Use only your few limited orders for lm and trie build.
You should obtain a small model with good results…
Hope that you have a good robot microphone.
(mine is a respeaker 4 mic array)
Thanks for your quick reply ! Lot of useful stuff in there.
The final goal is for a robot (don’t know what type yet, still pretty in discussion), so for now it’ll be an app on a tablet. 100% of the tablet resources will be for this app.
For now, we only need 10-20 orders yeah. Such as “Register”, the tablet know that “register” mean “register the next 10 min of your CPU usage” per example
I though of using combination between my commands words and what i call phoneme-words. It’s a group of words where each of them represent a particular phonemes of the french language, like [é] or [p]. The goal is for my model to know all the phonemes and have a quality recognition. But what you said :
makes me doubt this method.
If I get it right, you mean records “Register” with a lot of variation such as noise, radio noise, emotion, distance, intensity of the voice…
That’s mean my lm is a unigram ?
Haven’t though of the microphone quality as I start using audacity and modifying my records on it (and respects the caracteristics needed mono, 16kHz and 16 bits), but I will
I forgot to say, it’s a poly-speaker app. So can I count the 50-100 variations of sentences per order per speaker or does the speaker is another variation ? I don’t want my recognizer to be speaker-blocked…
Thanks again, your help is very useful ! It’s a vast subject and reading your tutorial and response helps me get in the right direction
Hi, you have done a great job here. I have learned a lot of things from this thread, it’s a perfect tutorial especially for a beginner like me!
My model is working so i decide to extend my dataset using data augmentation technique. Specifically i want to increase speech speed but when i call voice-corpus-tool help command i can’t find the parameter “speed” like you mentioned. I would appreciated it if you give an example of the command that you used in your case.
Could you tell us more about your use case, your data, the different steps and difficulties you encounter ? I might help people if you share theses informations
Thanks !
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
pinned
#111
Hey @caucheteux,
Firstly, I have been trying to fine-tune english model (v0.5.1) with Common Voice data. Later i plan to build my own Greek model, i have already given a try and i saw that following this tutorial’s steps i can build it successfully. However, training might bring upon difficulties. Now, i am working on the first mentioned issue (fine-tuning), you can find more details about my parameters and my difficulties on my last post https://discourse.mozilla.org/t/fine-tuning-deepspeech-model-commonvoice-data/41872/8.
The main problem was to find the appropriate learning rate. Using large values (1e-3 / 1e-4) training & validation loss reach infinity. Finally, i should refer that my basic purpose is decoding large files (e.g. 10 minutes clip of BBC-news). Right now i have WER percentage about 50% and i decided to extend training set (CV-data) increasing the speed of existing files and adding them in training data.
Can anyone help me?
Thank you !
I remember now ! I saw your problematic few days ago, it is very interesting as you treat subject like long recordings, fine tuning, and new language model.
you mean, start from a sentences in a file of 10 sec to a file with the same sentence of 5 sec for example ? What’s the goal ? Adapt your model to people who speak quickly ? Or it’s just to add a variation and enrich your dataset ?
This is exactly my goal, i want to boost model’s performance on quick speech so i decided to take same training samples, increase their speed and add them in my training set.
Also, i would like to hear your opinion for this thought.
i.e. is there any chance to boost the performance using 120.000 + extra samples (with speed 1.8)?
It’s about 1/5 of my cv-training dataset ~ 150 hours…
Thank you @elpimous_robot for the response and your advice. Could you give me the command you applied in your case with voice-corpus-tool? I can’t find how to set the speed parameter… At this point i will experiment increasing speed on cv data (mono speaker - short sentences <10 sec ) and testing on large bbc-news clips (10minutes) to check if it outperform my previous model (WER 50%).
hey @elpimous_robot,
thaks again for your response, always helpful
Yeah, I’m pretty screwed… haha Will be easier if my problematic was the same as yours
So if I get this right, you mean:
for each speaker:
for each word/order, 100-200 recordings
So with 10 orders of one words and 5 speaker: ~ 5000-10000 recordings ?
And last question: did this model will only work for the 5 recorded speaker or also for never recorded speaker ? I mean, do I have to record every new speaker who will be using the app and retrain the model with his recordings ?
That’s my concern, as it might be problematic to need 1h of recordings for each new speaker and re-train the model.
My idea is that if I get enough variety of speaker, I can have an acceptable versatility for my model and therefore don"t need to re-train the model everytime I have a new user
Yes ! The issue is to make a compromise between time to create dataset (and ressources) and quality of the model trained with this dataset…
I’m thinking of reducing the number of recordings per speaker per words/orders but increase the number of speaker. Does it seems a good way to reduce the amount of time needed per speaker without significatively decreasing the quality of my model ?
may be @lissyx or @reuben can help if they have any idea ?
You need an English model, right?
If yes, why not use the whole model, and create a lm containing only the words you need. @reuben, I think it could work, no?
Dear All, Thanks for this amazing instructions, I have prepared my own voice file to train a specific domain data for Bangla Language. I have created all the file according to the instruction, when I run the .sh file after 1-2 hours training it produce an error
" Fatal Python error: Segmentation fault
Thread 0x00007f3a569bd700 (most recent call first):
File “/usr/lib64/python3.6/threading.py”, line 295 in wait
File “/usr/lib64/python3.6/queue.py”, line 164 in get
File “/home/venvs/projectSTT/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 159 in run
File “/usr/lib64/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib64/python3.6/threading.py”, line 884 in _bootstrap"
Can anyone help me how I can successfully train my model ?