Creating An Online Environment To Train Specific Words

Hello. I learned a lot from my previous thread here thread and some lively discussions on IRC. The full featured model currently wouldn’t make sense on a small consumer computer like e.g. the Raspberry Pi. But a dramatically reduced model with 20 or 30 words would most likely be feasible. Given the fact that the speech-driven technology most likely runs on very small hardware, I think we should find a way to be able to do some limited STT even on those devices. The alternative is to use some cloud service to do the STT - with all disadvantages to ones personal privacy and the fact that one might not be able to turn the lights on because the Internet is down.

While it may be possible to run a very limited model on a Raspberry Pi, it requires the training of the specified words into a “personal” model that can then be used on the hardware.

At this time - this requires the download of the common voice database, the installation of all the tools including tensorflow and at least some knowledge on how to get everything running. Frankly - too much for an average developer who wants to integrate deepspeech into his or her project.

This is why I am calling for an online environment in which users upload their word list and are able to download the trained model. But this is nothing we can do without funding - at least for the server space. I am sure having individual limited models available to make deep speech feasible even on small platforms would boost the overall project and truly generate technology that can be used by developers in the field.

Mozilla has in the past generated dedicated spaces for specific tasks. This is not different. Deep speech deserves to be made available on all platforms and for all purposes - including home brew IoT services that now need to rely on network streaming services.

So, thanks for the thread. This is something interesting, and we are looking into it, but so far we cannot make any promise nor commitment, sadly :).

Yet, any contribution on the pipeline to help achieving this would be welcome !

@lissyx -

Thanks for the answer. Would you like to explain why you are currently not able to extend the deep speech environment into something that would make it so much easier for folks out there to use it?

I understand that this project is currently more of a research thing. But it could be turned into something very useful even now with just a little effort and a few additional resources.

Michaela

Simple:

  • not (yet) enough hardware to provide that and be able to do our daily work
  • no way (yet) to make all this hardware available externally (its on mozilla’s internal network)
  • no tooling (yet) to expose that to a website
  • nobody has time to work on that kind of feature for the moment

What you call “a few additional resources” is not just “a few”. Again, really, please, understand that we are in sync with what you would like, but we cannot do everything at once. And no, we are not just “a research thing”. We actually want to make something that developpers can seriously build on top of, and rely on.

Now, I do hope you are not the only one wanting that kind of feature, and that you and some other contributors might be able to start working on some code that allows to achieve this kind of service. We cannot promise anything yet but we will do everything we can to help on that. And sadly, we cannot provide any ETA, since it depends on too many things so far.

Voice is currently the driving force behind a new evolution of technical services. I am not sure if (and how deep) you looked at Alexa, Google Home and all those services currently flooding the markets. I suppose those folks over there have plenty of experience on how to create STT technology.

It may surprise you that those services generally do not work by interpreting a full corpus of naturally spoken language. As a matter of fact - they require programmers to submit a number of keywords (“utterances”) to be linked into “slots” that will trigger a pre-defined action when recognized. They call that a “skill” and I wrote a number of those skills. Every skill - from pizza ordering to flight reservation uses a pre-defined sub-set of words that needs to be “compiled” (for a lack of a better word) before it can be used. I strongly suspect that they use those 20-30 seconds “compile” time to train or otherwise create a subset of words to be recognized for that particular skill.

There is certainly a reason for Amazon, Google et al. to do it that way. It’s not only way faster, but it is also much more reliable. Why? Because if you configure the utterance “bed room” the engine will not come back with “brads room” or “pet room”. If you train the engine to only recognize “red”, “green” and “orange” and then ask it to transcribe “purple” it can stop processing after the first few iterations because there’s no viable way forward.

So - when I ask for an online environment to create greatly reduced models, I am only suggesting to branch out a path that re-produces the way the market-leaders in voice recognition have chosen within their environments. So - I am quite sure that I am not the only one who wants that kind of feature.