Mozilla Voice STT in the Wild!

Tēnā koutou kātoa!

Te Hiku Media is a Māori organization based in Aotearoa (New Zealand). Our purpose is to preserve and promote te reo Māori, the indigenous language of Aotearoa.

We’ve been using DeepSpeech since May 2018. We found it worked pretty well for te reo Māori, which is an oral language that was phonetically transcribed in the 19th century. We have an API running with a w.e.r. about 10%, and we use this API to help us speed up the transcription of native speaker (L1) recordings.

Deployment
We’re running DeepSpeech within a Docker on a p2.xlarge DeepLearning Ubuntu AMI in AWS. This is behind a Load Balancer and an Auto Scaling Group which allows us to bid for those old p2 instances at a relatively affordable cost (we spend about USD$1000/month to keep the API available 24/7). We use FastAPI to load and run DS in python. We’ve got a Django instance, koreromaori.io, between this API and what the end-user sees (there’s reasons for this) but we’re in the process of figuring out how to more efficiently deploy DeepSpeech. Keen to hear what others are doing.

Use
For many reasons, some of which you can learn about in this Greater Than Code podcast and this Te Pūtahi podcast, we’ve built our own Django based web-app to collect te reo Māori data, koreromaori.com [corpora]. We started this around the same time as the Common Voice project and because my experience was in Django, it made more sense for us to work on corpora. Of course in hindsight there are many more reasons why using your own platform for data collection can be useful. For example, all the data is available through an API which helps us when it comes time to train models. We also label data specifically to our context, such as whether a speaker is “native” (L1 vs. L2) or whether pronunciation or intonation is correct. Finally, for indigenous languages, it’s often more appropriate for the data to remain with the community rather than being put in the public domain.

Since we were able to train a DS model early on, we use this model to help us “machine review” data. We also built our own transcription tool, Kaituhi, to help us with transcribing our audio archives. It’s kind of like the BBC React Transcript Editor which I found out about AFTER we started work on Kaituhi. We use koreromaori.io to provide automated transcriptions for Kaituhi, and we’re hoping to add word level confidences to the transcriptions to speed up the review process (the confidences are in the DS api, they’re just not exposed yet in our API).

Kaimahi & Community
Here are some of our team also on this Discourse @utunga @mathematiguy - you may see them ask questions from time to time so please chime in!

The main reason why our initiative to build STT for a language that nearly went extinct was successful is because of the community around the language. We cannot forget the hard work done by so many during the 20th century to make te reo Māori (and other indigenous languages) a living language once again. I think Mozilla’s done a good job with building a community around Common Voice. If you’re someone working on language tools for non mainstream languages, building trust with the right community is critical to solving the data problem. Also understanding that there’s a level of respect and responsibility that comes with access to data is important.

Ngā Mihi
We wouldn’t be where we are today in terms of the technology if it wasn’t for DeepSpeech. So a big thank you to Mozilla and the DeepSpeech team :clap:t4: and all of you who are an active part of the DS and common voice community!

14 Likes