Mozilla Voice STT in the Wild!

With Mozilla Voice STT being open source, anyone can use it for any purpose. This is a good thing!

However, in order to maintain support for Mozilla Voice STT within Mozilla we’re are often asked…

Who’s using Mozilla Voice STT and for what purposes?

With that in mind I thought it’d be a good question to put to this audience, the users of Mozilla Voice STT.

Who are you and for what purposes are you using Mozilla Voice STT?


We are a startup for NLP in German and we currently develop a model based on CV etc. plus currently 600 hours of tv shows for a German broadcasting company.


Hello there! We are a startup that provides a cloud based phone system to companies, and we are actively looking at DeepSpeech to build new features (searchable recordings, call labelling, call scoring, to name a few) without using mainstream, commercial APIs, while being able to tune our own models for our specific usecases (i.e, noisy phone calls and voicemails).


We are a french consortium, named Esup-portail (, who developp and maintiens some applications used in in most of the university in France. One of the application, called Esup-pod (, is a web platform to store and watch videos. We use Deepspeech to transcript the french and english audio of the video and create automatically subtitle. Don’t hesitate to contact me if you want more details !


My company makes products targeted at the entertainment industry. Our cloud workflow platform is used by customers to share work-in-progress content with their colleagues and clients and get feedback.

Currently we are running DeepSpeech as a preview/beta using the public models. We have tried to be open with customers about the situations it works well with and set expectations correctly. In the meantime, we are working on training a model that better matches the kinds of files our customers upload. (Because it is a preview it is not currently listed as a feature on the homepage but it’s there.)

Our customers’ use-cases tend to fall into two categories: transcribing finished content (caption files / text transcripts are often required when submitting content to distributors) and transcribing raw camera footage (dailies) to make the content easily searchable. We are planning to build additional features on top of transcription like version diffs.

There were two main barriers to using Google/AWS for transcription:

  1. Per-minute billing. Many customers upload hundreds of hours of content per month and we would have to implement usage limitations to stop costs spiraling out of control. We think transcription of video content is useful enough that it should be easily accessible and we therefore don’t plan to charge for transcription, unlike all of our competitors.

  2. It’s not self-contained. We offer customers the ability to self-host our service if they prefer, so we needed a solution that does not require a connection to an outside service. Also, as much of the content we host is pre-release movies/TV shows, privacy is important and the fewer third-parties we share with, the better.

DeepSpeech obviously solves both of these problems, as well as letting self-hosted customers create their own models to better match their content if they wish to do so. As a small company, the connection with Common Voice was extremely important to us as well.

If you have any questions or want me to run experiments, I am happy to do so, although I cannot of course provide or reveal customer information.


At Bangor University we’re using DeepSpeech (and Welsh CommonVoice) within a Welsh language digital assistant project. Based on Flutter, our app for Android and iOS can respond to simple questions regarding weather, news, time, Welsh language wikipedia and Welsh language music on Spotify, thanks to a hosted DeepSpeech server.

We’re also evaluating DeepSpeech’s recent work on transfer learning for larger domains such as dictation and captioning. Results so far have been very exciting for a lesser resourced language like Welsh. It would be awesome to see transfer learning supported in main releases of DeepSpeech.

Thank you so much Mozilla for DeepSpeech!!


My name is Dan, and I’m working on a voice + motion control system called Jaxcore. I’m using DeepSpeech to add speech recognition to control computers / home theaters / and smart home devices.

The upcoming open source desktop app will have DeepSpeech built-in and voice commands for doing things like controlling your computer mouse, typing on your computer, controlling media players.

You’ll be able to write web games that have speech recognition (using only client-side Javascript):

The speech recognition libraries are all modular and can be used individually:


I use Deepspeech as a local STT engine for It’s called via the deepspeech-server tool, so I can have multiple devices access it. Runs on a desktop cpu quite nicely. The audio is saved from mycroft, which I can use for fine tuning down the road sometime.

There’s a fine-tuned model and I use some filtering to help. Accuracy is good, latency is good.


This one?

1 Like

That’s the one. Used that for a while now.


Our company Iara Health provides a system to aid radiologists in the writing of medical reports in the portuguese (BR) language. All our system is built over DeepSpeech, running locally on the user’s computer.

In the video above, you can see our portal being able to recognize commands (like loading a template), and handle punctuation, acronym and abbreviations. Our system eases the work of radiologists, making them produce more in less times.

We want to thank Mozilla for DeepSpeech.


We’re using DeepSpeech in to recognize Quran recitation and correct people’s mistakes on a word-by-word level!
The Quran is the Muslim’s holy book and we are instructed to recite it with "Tarteel, which translates best to “slow measured rhythmic tones”.
Muslims who try to memorize the Quran sometimes struggle to find an instructor someone to correct them. The Tarteel platform provides a “Quran Companion” they can use to recite to when they don’t have any to correct them.


I’m French, autodidact…

I’m a robotician (passion only), and I work on social interactions between robots and humains.
Vocal interactions are imperatives…
Thanks to Deepspeech.

I made a tuto to help each other create it own model.:wink:


I’m a computer scientist student intrigued and interested by data science, who want to deploy a spanish model Speech-To-Text that can be easily integrated, easy to use and can be flexible. I actually find out that in some cases my DeepSpeech spanish model can outperform Google and IBM Watson Speech-To-Text models in real situations with just 450 hours for train dev and test.


We are Vivoce, a startup using DeepSpeech to detect pronunciation errors and help users improve their accents for language learning.


I hope to use Deep Speech for African Languages soon in the future


What languages are you interested in?

We use Mozilla DeepSpeech for voicemail transcription in FusionPBX via our DeepSpeech Frontend and some code we upstreamed into FusionPBX to add support for custom STT providers. Our users find transcriptions quite useful, with Mozilla DeepSpeech serving them with transcriptions since August 2018!

I would love to collaborate with @Gabriel_Guedj and others to build tuned models and deeper integrations in the telephony space. Feel free to reach out, I am in on Matrix.

Mozilla DeepSpeech is awesome, I really appreciate all the hard work @kdavis, @reuben, @lissyx, and other contributors have put in over the years to build this!


A student at Dalarna University, Sweden, trying to use deepspeech to train a model for the Somali language