Mozilla Voice STT in the Wild!

We’re using DeepSpeech in tarteel.io to recognize Quran recitation and correct people’s mistakes on a word-by-word level!
The Quran is the Muslim’s holy book and we are instructed to recite it with "Tarteel, which translates best to “slow measured rhythmic tones”.
Muslims who try to memorize the Quran sometimes struggle to find an instructor someone to correct them. The Tarteel platform provides a “Quran Companion” they can use to recite to when they don’t have any to correct them.

8 Likes

Hello.
I’m French, autodidact…

I’m a robotician (passion only), and I work on social interactions between robots and humains.
Vocal interactions are imperatives…
Thanks to Deepspeech.

I made a tuto to help each other create it own model.:wink:

4 Likes

I’m a computer scientist student intrigued and interested by data science, who want to deploy a spanish model Speech-To-Text that can be easily integrated, easy to use and can be flexible. I actually find out that in some cases my DeepSpeech spanish model can outperform Google and IBM Watson Speech-To-Text models in real situations with just 450 hours for train dev and test.

4 Likes

We are Vivoce, a startup using DeepSpeech to detect pronunciation errors and help users improve their accents for language learning.

5 Likes

I hope to use Deep Speech for African Languages soon in the future

2 Likes

What languages are you interested in?

We use Mozilla DeepSpeech for voicemail transcription in FusionPBX via our DeepSpeech Frontend and some code we upstreamed into FusionPBX to add support for custom STT providers. Our users find transcriptions quite useful, with Mozilla DeepSpeech serving them with transcriptions since August 2018!

I would love to collaborate with @Gabriel_Guedj and others to build tuned models and deeper integrations in the telephony space. Feel free to reach out, I am @dan:whomst.online in #machinelearning:mozilla.org on Matrix.

Mozilla DeepSpeech is awesome, I really appreciate all the hard work @kdavis, @reuben, @lissyx, and other contributors have put in over the years to build this!

5 Likes

A student at Dalarna University, Sweden, trying to use deepspeech to train a model for the Somali language

2 Likes

Hi victornoriega7,
Could you share your deepspeech spanish model? I am quit far of getting so many hours of transcribed data in spanish to train my own spanish model.
Thanks
ana

I am building an online video consultation tool for primary care (eg. GPs, family medicine) and using DeepSpeech to enable transcription of the meeting. I am currently working on integrating DeepSpeech with Jitsi (an open-source videoconferencing tool).

Potentially once the integration works, the transcription generate by DeepSpeech can be used to run ML algorithms over it to generate suggestions for doctor.

1 Like

Maybe you could talk to @bernardohenz? He might have advice on adapting DeepSpeech to the medical domain.

I am building a small personal companion that has the following features:

  1. It does not require Wifi or internet connections.
  2. It will be powered with less than 5 volts.
  3. It will auto recharge with light.
  4. If you talk to it, it will talk back to you intelligently with a voice.
  5. The companion can talk to you an hour a day for 15 years, with no overlap.
  6. The companion will have an extensive memory and will learn from you.
  7. It will be no larger than an apple or a large deck of cards.
  8. The per-unit cost will be less than $40 USD.

I am doing this because I can do it and I want to create something interesting.

3 Likes

I am trying to build a model that understands german dialects.

3 Likes

Tēnā koutou kātoa!

Te Hiku Media is a Māori organization based in Aotearoa (New Zealand). Our purpose is to preserve and promote te reo Māori, the indigenous language of Aotearoa.

We’ve been using DeepSpeech since May 2018. We found it worked pretty well for te reo Māori, which is an oral language that was phonetically transcribed in the 19th century. We have an API running with a w.e.r. about 10%, and we use this API to help us speed up the transcription of native speaker (L1) recordings.

Deployment
We’re running DeepSpeech within a Docker on a p2.xlarge DeepLearning Ubuntu AMI in AWS. This is behind a Load Balancer and an Auto Scaling Group which allows us to bid for those old p2 instances at a relatively affordable cost (we spend about USD$1000/month to keep the API available 24/7). We use FastAPI to load and run DS in python. We’ve got a Django instance, koreromaori.io, between this API and what the end-user sees (there’s reasons for this) but we’re in the process of figuring out how to more efficiently deploy DeepSpeech. Keen to hear what others are doing.

Use
For many reasons, some of which you can learn about in this Greater Than Code podcast and this Te Pūtahi podcast, we’ve built our own Django based web-app to collect te reo Māori data, koreromaori.com [corpora]. We started this around the same time as the Common Voice project and because my experience was in Django, it made more sense for us to work on corpora. Of course in hindsight there are many more reasons why using your own platform for data collection can be useful. For example, all the data is available through an API which helps us when it comes time to train models. We also label data specifically to our context, such as whether a speaker is “native” (L1 vs. L2) or whether pronunciation or intonation is correct. Finally, for indigenous languages, it’s often more appropriate for the data to remain with the community rather than being put in the public domain.

Since we were able to train a DS model early on, we use this model to help us “machine review” data. We also built our own transcription tool, Kaituhi, to help us with transcribing our audio archives. It’s kind of like the BBC React Transcript Editor which I found out about AFTER we started work on Kaituhi. We use koreromaori.io to provide automated transcriptions for Kaituhi, and we’re hoping to add word level confidences to the transcriptions to speed up the review process (the confidences are in the DS api, they’re just not exposed yet in our API).

Kaimahi & Community
Here are some of our team also on this Discourse @utunga @mathematiguy - you may see them ask questions from time to time so please chime in!

The main reason why our initiative to build STT for a language that nearly went extinct was successful is because of the community around the language. We cannot forget the hard work done by so many during the 20th century to make te reo Māori (and other indigenous languages) a living language once again. I think Mozilla’s done a good job with building a community around Common Voice. If you’re someone working on language tools for non mainstream languages, building trust with the right community is critical to solving the data problem. Also understanding that there’s a level of respect and responsibility that comes with access to data is important.

Ngā Mihi
We wouldn’t be where we are today in terms of the technology if it wasn’t for DeepSpeech. So a big thank you to Mozilla and the DeepSpeech team :clap:t4: and all of you who are an active part of the DS and common voice community!

14 Likes

Having posted earlier in this thread I just wanted to update on details of the scripts we’ve developed at Bangor University that bring together the various features of DeepSpeech, along with CommonVoice data, and provides a complete solution for producing models and scorers for Welsh language speech recognition. They may be of interest to any other users of DeepSpeech that are working with a similarly lesser resourced language to Welsh.

The scripts:

  • are based on DeepSpeech 0.7.4
  • make use of DeepSpeech’s Dockerfiles (so setup and installation is easier).
  • train with CommonVoice data
  • utilize transfer learning
  • with some additional test sets and corpora, produce optimized scorers/language models for various applications
  • exports models with metadata

The initial README describes how to get started.

We’d like to share also the models that are produced from these scripts which can be found at https://github.com/techiaith/docker-deepspeech-cy/releases/tag/20.06

At the moment these models are used in two prototype applications which the Welsh speaking community can install and try, namely a Windows/C# based transcriber and an Android/iOS voice assistant app called Macsen. Source code for these applications using DeepSpeech can also be found on GitHub.

We are immensly grateful to Mozilla for creating the Common Voice and DeepSpeech projects.

11 Likes

Hello, we are polish comapny that produces software for logistics purposes. Our R&D team is working on “hands free” system/device for workers in warehouses. We have Android app that is displayed on AR glasses and DeepSpeech will be responsible for ASR (not full speech; only couple hundred of commands) to control logistic processes by speech (barcodes and numbers dictation, commands like: confirm, cancel, print, go back, go up, go down etc.). For now we are mainly focusing on polish speech recognition.

I wanted to say that DeepSpeech is awsome and to DS team: your work allows many researches and comapnies around the world make great, new tech. Thank You for that!! Also DS community is very helpful :heart: I am very greatful for Your work and involvement. Thank you once again.

4 Likes

Hi, I did build a Voice Assistant for multiple languages which is called Jaco-Assistant.
It can run completely offline and the code is open source.

There’s also a skill store where you can share and download skills. Jaco automatically generates domain specific language models out of the installed skills to achieve high detection accuracy.

In a benchmark Jaco did perform better in understanding speech commands than solutions from Amazon and Google (but it has some problems with loud background noise):

The assistant uses a highly modular and container based architecture which also allows to easily reuse parts of it for other projects. The skills run in containers too, which has the great benefit that developers are not restricted if they have special requirements for their own skills.

Currently Jaco understands English and German and runs on linux computers. It is extendable to further languages and I hope I can get it to run on Raspi in the next time too.

You can find the project here:

At this time it’s working and usable, but far from perfect and I’m happy about everyone wanting to contribute to the project by translating it to other languages, solving issues or creating cool new skills:)

Update: Jaco can now be built and run on a Raspberry Pi 4. Also updated the benchmark graph.

Update: Added support for French and Spanish. Improved noise resistance and accuracy using new DS-0.9 model. Updated benchmark graph.

5 Likes

Definitely in the wild, this one. I am building a sexbot companion as a personal project. In-head raspberry pi captures 4 channels of audio, does some basic noise reduction, classification, segmentation, and sends signal over wifi to be processed by DeepSpeech on a desktop gpu, which returns the result. I have that much working. Currently collecting more data to train the models. It’s mostly gibberish now, but I expect in a few more months and a lot more GBs of wav files the words will flow. Thank you for making DeepSpeech.

We are a Persian Startup working in STT field. we are trying to achieve result with mozilla deep speech.

Yet we are just training to find good chekpoints.

1 Like

Hello, back in 2018 I was just a .NET dev deploying cloud-based ASR, as a side project to my main job I wanted to perform Spanish offline speech recognition for my personal “chatbot” and that’s how I discover DS on GitHub, long story short, I ended up replacing cloud ASR with DS for almost all of my apps :blush:

Projects I’m working on with this new superpower thanks to STT Team and Mozilla:

A .NET app to perform dictation of names, ids, long numbers, dates, details, notes, etc from images on PDF for form filling(offline OCR fails ):
This app is working also with web site similar to common voice and private storage that collects all the audio and text from the form inputs giving the ability to periodically fine-tune on corrections thus outperforming any other cloud ASR. Also, optional transcriptions were key to succeed, usually there are names with different chars at the end but they sound the same. @juan_pablo_Garzon_duenas is also working with me on this and he actually requested the app, thanks! (We meet on DS GitHub) :laughing:

And couple other projects I can’t share about, but they are all similar.

An app used by a microphone firm to perform general dictation, we are achieving really good results (can’t share too much detail on this one) but I can share a video of it vs YT transcriptions on the chat. We are also using Mozilla RNNoise as VAD, thanks again Mozilla!

All of my solutions are based on the open-source WPF example :blush:, fun thing: they found me throw Reuben’s post on Mozilla hacks, thanks again Mozilla.

Key features of DS vs others on this type of projects:

  1. Privacy
  2. Accuracy
  3. Continuous fine-tuning
  4. Puts value on the time and data collected (Can’t do this with cloud ASR, it is illegal to store transcriptions for most of them) this is key, they don’t want to only make Google better by giving away all the data and not being able to store valuable results out of their own data.
  5. Not limited to a single programming language giving the ability to work on almost any existing environment. Usually, open-source ASR only uses Python or C++ which makes it hard to keep a team if they don’t know both, with DS is easy for the team to adapt using their loved lang! (See Nvidia nemo or Kaldi)
  6. The data for Spanish is really limited, there is no dataset like libri for Spanish, making any small dataset built very valuable.

Now the sad part:
They love Mozilla STT and choose it because privacy, they don’t want to share any data making the best of the engine the worst for Mozilla data growing :frowning:

Also looking forward to context-aware hot words, I’m with an eye on @josh_meyer code, this is requested frequently to being able to dictate corporation names took from contracts on the fly.

Thanks to everybody, I love to see how my .NET example is used to build amazing things!

Thanks, thanks thanks thanks!!! And finally: thanks :stuck_out_tongue:

Extra thanks: I know this if for STT but I definitely want to thanks @erogol, I currently not using TTS but there’s a lot of requests of TTS on the fly for information while inside elevators (mostly for hotels on this COVID era), also for kids learning English adapting the listening on their weakness, hopefully, I will grow using Mozilla TTS.

Thanks,

3 Likes