Hi,
Any chance that you’ll release the created model ?
Would it work (usage , not creation) with a low end computer as a raspberry pi ?
it would be nice to use it in french with openjarvis.com for example.
Hi, tbozo,
Well, my model wouldn’t help you, and I explain why :
- the model is limited to my own voice -> it wouldn’t recognize you at all !!
- the model is strictly limited to my bot crossed questions possibilities.
Yes, it should work on a RPI2-3 (but ask Gerard-majax for that)
But I plan to create a multi-speakers french model, helped with voxforge.
This one would suit you !
This model will work on one of my next robot, the QBO1 (CORPORA), based on arduino/RPI3.
Now, about openjarvis, my AI is based on Rivescript-python (very-very powerfull !! You should try it !!)
Hope this helped you.
It’s going to highly depends on your expectations, but I tend to remember that @elpimous_robot model was smallish enough that it would run not that bad on RPi2/3. I would not expect realtime though, but it might be manageable in your case.
If you need some other voices for your model I might help you.
Openjarvis is nicely packaged and easy to use (at least if you use raspbian Jessie and not stretch). I use snowboy for offline hotword detection and bing otherwise. It works quite well in french (some problems with my 7 years old child)
As for the interactions it seems that the syntax is comparable at least for basic tasks. there is a plugin for rivescript called Rivescript bot.
Unfortunatly I can’t afford time consuming project right now…
I’m a C++/Python guy at work, but at home I go for the easiest
@lissyx I was looking for a real time seems that I might need to change my hardware
I’ll stay with my bing API right now…
hi,
how many fr voices do you have ? I’m interested !
For now, I only have nearly 5h of my own voice (nearly 5000 train samples…)
Working on voxforge, to recover all fr material, but it’s harder than I expected (It would take more time…)
With a standard STT, child voice is hard to recognize, due to a different frequency;
but, with deep learning, it pass this restriction.
send me private msg for specific french discussion, if you want !
Is it possible to create a new trie/language model (as explained above) using transcripts with more jargon/technical speak, then use those in conjunction with the pre-built Mozilla firefox deepspeech output_graph? I don’t have the resources to train a completely new model, but I can certainly generate the language models for my specific (technical) version of English speakers. I’m just not sure if using the pre-built output_graph with the new language model and trie will work.
I cannot find program named “generate_trie”… In my DeepSpeech folder there is a subfolder named native_client, but there is only generate_trie.cpp. Should I first compile it somehow? Could you give more instructions on how I can call generate_trie?
Hi Mark2
To obtain the generate_trie file,
I had to compile native client !
Have a look at native_client/readme.md file
"Bazel_build…"
Hi Dj-Hay.
Sure. Creating a new trie file / vocabulary could help you to recognize new words/sentences.
Be sure to have a complete sentence per lign, on your vocab, and not only 1 word !!
If you don’t want to do all the setup for building deepspeech from source, I’d recommend downloading mozilla’s pre-built native_client and use generate_trie command from there - see https://github.com/mozilla/DeepSpeech/tree/master/native_client
Basically running the following command should do the trick.
python util/taskcluster.py --target /path/to/destination/folder
I ran the command and called pre-built generate_trie-program. However, I got a “-bash: ./generate-trie: cannot execute binary file” error, although it has execution permission for all. Is it because it was compiled on Linux and I use MacOS? Is there any workarounds or should I compile the program from sources?
taskcluster.py
downloads linux by default. You need to pass --arch osx
as documented.
Great, thanks! I’m curious as to what the neural network is doing then. Is it generating a bunch of vowel/consonant sound primitives that are fed into the trie/lm.binary? Then that trie/lm.binary decides which of the most probable words that ordering of vowel/consonant sound makes?
Ah, nevermind. I think the original paper (https://arxiv.org/abs/1408.2873) does show that the DNN part is predicting characters from the alphabet. Thus the DNN will create a large chunk of letters/spaces from the given audio. Then, that will be fed into the language model (which is completely separate from the DNN) and will optimize which combination of spaces and letters make the best sentence/words. Please correct me if I’m wrong.
Thanks for your tutorial, I’m currently training a german model using an open source corpus, this is a big help!
I was wondering why you use your vocabulary.txt instead of your alphabet.txt in your --alphabet_config_path parameter for DeepSpeech.py?
Hi. Happy to help you.
Thanks for the question : My fault !!
Of course, you link to alphabet.txt !!!
Vocabulary.txt is used for lm/trie…
If you see others errors…
See you.
Yes.
In alphabet.txt, you only have symbols !!
Each symbol is a label.
Deepspeech learns each label with a lot of sounds.
Some others params lm/trie work hard to evaluate one heard sentence, and predict result inference)
Thanks for your tutorial. We have hundreds of audio files for just one person/speaker and are considering making a specific model. Was considering breaking up each audio into single words, for training purposes. However, now I see by your comment that a complete sentence is preferred.
My thinking on using the single word approach was to significantly reduce the size of the model, as it is for one person/speaker. For example, a 19 second WAV that has 55 words has 33 unique words. Is there any advantage in using the same word by the same speaker for training the model ?
I guess my question is - how differently can one person speak one word ?
Hi. JHOSHUA
I give you and easy answer :
Do a test :
Record 2 words, with same tone and duration,
Open both files in audacity and zoom them.
Your eyes will detect variations.
And we’re only thinking of your voice…
Our environment is really noizzy.
Keep in mind that your computer is a bit silly : for it, variations = different.
The more sounds per character,the easier for the silly pc to recognize…
Now logic sentences are imperative for trie build, to help deepspeech to process a good inference
Hope to help.
Oh, I forgot a part of your question : record differently sentences.
I’ll update the tuto this afternoon.