Create A SubSet of existing models

I was wondering if and how it would be possible to extract a subset of commands to be recognized out of the big trained model. Some of us just need a number of words, utterances or phrases to work with in our projects - limiting the STT engine to just those words would greatly improve speed.

So - is it possible? And how should I proceed? Training a model is myself is out of the question as my project is aimed at a much larger audience.

Thanks.

Michaela

1 Like

Thanks for creating the topic. I’m still unsure as to why “training a model is out of the question”, can you detail your thinking about that? I guess there’s some misunderstanding lying under.

Hey Lissyx - the phrases and utterance are variable and the targeted system is not a personal device but would be made available to a wide® audience. It is the plan for the individual devices to upload the required phrases (like nurse, help, light, temperature) to our server - we could create a specific model for that devices and transfer it into the remote devices realm. Each individual device might need only 10 or 20 words (phrases) which should speed up the process dramatically.

Does “creating a specific model for that device” involve training? If not, how do you plan to “create a specific model for that device”?

And yet, I’m puzzled, because what you describe seems, to me, like a good first step towards creating your own model.

I think what @mischmerz is trying to do is to target the inference of the released model to only a few words and is hoping that there is a method to extract the part of the acoustic/language models that are only relevant to the chosen words. (perhaps not enough audio data samples for acoustic model training are available?)

AFAIK, acoustic model cannot be targeted but potentially a new language model that would incorporate only the chosen words could be used for the decoding part. This does not require audio data training, just some representative text containing the words you need. If people start saying things outside your language model, they are still likely to be interpreted as the language model words.

You can see an example of how to setup a new language model in this topic https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830/44 .

This method won’t be faster on the acoustic part but the decoding should be faster a bit.

When I set “utterances” to be recognized in an Alexa environment (linking words to an action to be triggered) it takes about 2 minutes or so until the skill is ready for testing (about 30 small phrases like “door lock”) It may very well be a training process, but it also may be some form of pre-trained models being adapted to be used by my skill (the “app”). I’d like to propose or even create something similar: A web-site where folks can upload some words or short phrases they need for their environment, and the process returns a trained model for this particular task.

I guess my question would be: How to achieve that?

Well, a combination of what has been described above. But at some point, if one of your needs is smaller model, there’s really no other solution than actually performing a new training. Different graph width cannot be merged together.

OK - let me specify: Is there any data that would allow me to create a subset of phrases to be recognized and to train a universal model (that understands a wide audience of speakers) to achieve my goal? Let’say I want a model that only understands the words “Red”, “Green” and “Blue”. Can I access some raw data containing those words and train a new model? And how long would (realistically) take on average server hardware to train a 3 word model?

For the data, I guess you could try to process those from Common Voice for example. Training time will depend on your exact hardware (GPU) and the parameters you set (width of the network, etc).

Training the single-audio sample LDC93S1 that we have in the repo on a model of width n_hidden=494 for 50-80 epochs (overfitting) takes less than a second for each epoch. Training is completed in 19 secs for 120 epochs on my single GTX1080, Core i7 4790K desktop.

Got it. So it seems to be somewhat realistic to create some service that allows folks to upload a list of words, to train those words into a usable very limited model within a number of 1-2 minutes and return the model to the user?

It would be, yes … :slight_smile:

Well - as this might be of common interest - anybody here willing to work with me on this one? Any change to borrow some space on one of Mozilla’s servers?

As of now, we don’t have the training power to achieve that, especially considering the testing we need to perform soon. If you really target small model, it would be fast. @elpimous_robot trained on the GPU of his Jetson board, for example :slight_smile:

I wasn’t worried too much about the speed of the process. It could even be queued with the user receiving an email once the training has been completed. I know that lots of folks would love to use Deep Speech on small(er) system and that would required a tiny model in order to get fast and reliable recognition e.g. on a Raspberry Pi. Why worry about thousands of words (and the possible misunderstanding) if all I need is … well … ten words? So - it would be cool if we would have a platform doing the training for us. Now - I am willing to code this, but I don’t have the resources to make a platform like this available to a broader audience …

If you target small dataset, then it might be cheap to play on EC2. I’m not a big fan of Amazon, but they have Telsa V100 under the instances P3, like p3.2xlarge: I found it ranging between $3.06 to $3.30 (per hour) depending on the region. And you can work on any CPU for hacking / dev.

In the current status, even queueing would not be a good solution: there is a lot of work to do to be able to expose that kind of feature, and hw availability would make the queueing useless so far.

Lissyx - don’t get me wrong. I sincerely appreciate the work you all are doing. Now I may be completely out of line, but I am afraid that the “deep speech” project is running in the wrong direction. As I am an initial sponsor of the Mozilla project, I take the liberty to suggest an alternative route. I am sure it’s cool to be able to say anything and have it transcribed into text. But that’s IMHO not really what the community really needs. Most of us want an engine that can be used on our small or medium size computers to do some basic reliable voice assistance without the need to hook up to anybody else. I understand the use cases for a full recognition of natural language and nobody says that the research shouldn’t continue. However - the project goals should be incremental - producing some modular, usable and viable technology along the way that developers can implement now. Because that is how a project gains traction, interest and funding. As long as folks can’t apt-get install deep speech and use it on their Raspberry PI , they will use other resources and deep speech, while technologically interesting, remains just another project like kaldi - interesting, nice … but not really feasible. Again - that’s just my opinion.

@mischmerz.
Hi.
I think you’re strong with Deepspeech team.
The work is hard, and a small but specific board can permit a lot of possibilities.

Deepspeech was one of my dreams, and it comes true.

Sure it could be better !! Be patient…
The futur is soon…

Wish you the best.
Vincent

Vincent - It is not that deep speech is bad. Far from it. I find it truly amazing. But I think concentrating on a full trained model is overkill for the time being. It simply takes too much work and resources to get it to work nicely on consumer grade hardware. I understand that the team is working hard to get results. However - a model with just 20 or so words would allow that most likely with the technology available today. All it takes is an interface that allows users to define “their” words and train them into a model with a far smaller footprint. But as far as I understand it, that is unfortunately not the priority now.

Well, you’re use case is just one of the scenarios but certainly not the only one needed by the community.

I’d say that if you have the training data for those few words, just retraining pretty much on any machine would take just a few minutes as described above.

I think cheap things like having a commandline utility or a docker image for such basic deepspeech training would make it more ergonomic and surely would help in your case too.