Please tell me if I am wrong but I get the impression the longer-term DeepSpeech roadmap emphasizes ‘on device’/‘in browser’ speech recognition as opposed to hosted STT behind a rest or WebSocket API endpoint.
I guess it makes sense. For one thing, it has privacy (and latency) advantages to keep everything on the device/in the browser. And secondly, as an open-source project, it’s not really a good fit to be thinking about hosting questions - because who is going to pay the hosting bill etc?
That said, there are a lot of end-users who do want to solve the ‘hosting’ problem (like us for example ). Especially with streaming and in a scaleable way. I know there are plenty of examples of how this is done, and we have instances of streaming and no-streaming endpoints up and running (at various levels of maturity and not necessarily up to date with latest DS version, though).
I guess I was just wondering if it’s in the longer-term road map to think about hosting as a first-class concern within the library itself. A sort of opinionated ‘best-practice’ way to do hosting as a subfolder or separate repo? In particular, hosting in a way that is efficient in terms of resource usage (eg via lambdas/serverless style) so that it is as affordable as possible to keep the service available (given that GPU hosting is not cheap).
If it’s in the roadmap but just a question of lack of resourcing I wonder if we could perhaps contribute code towards this and/or team up with folks who are also working on this aspect?
(I don’t work for mozilla so I can’t speak for their roadmap)
But…
That said, there are a lot of end-users who do want to solve the ‘hosting’ problem
i have been working on a way to make hosting your own DeepSpeech server completely unnecessary, by running DeepSpeech locally in an ElectronJS desktop application, and through a web browser extension inject the speech recognition results into the browser. This will allow developers to build speech recognition enabled web applications using only client-side JavaScript. And the same solution works in Firefox, Chrome, and Brave browsers.
So instead of web developers hosting X number of instances of DeepSpeech on big high-memory cloud servers, the users are required to “bring their own speech recognition system”.
The results are pretty great and I really think this is the way to go. It doesn’t make sense to me that every web application needs to have their own speech recognition system. That’s not really scalable. It’s too difficult to build, and host. But installing one speech recognition system on your own computer, and re-using it across many websites will scale infinitely, and you can add voice controls to any existing web page with just a bit of JavaScript code – no audio data is sent over the wire, only the text results from DeepSpeech.
Plus, you get all the benefits of having the speech recognition system running locally – works offline, privacy, control your computer, use it to type things on your keyboard… and the ability to program your own commands.
If you’d like to be among the early beta testers, I’ll post on this discourse when the app & browser extension is ready. I have very little other documentation right now, but I do all the development on github at these repos: