Timeline for releasing the DeepSpeech models trained with the Common Voice data

Hi, I wanted to ask if you have a timeline or a workflow designed for releasing the DeepSpeech models trained with the Common Voice data, especially since the multi language voice recording is launched. What I am looking for is more related to the milestones, like “after receiving the first 200 hours of recording we will release the first trained models within x weeks.”

I have looked at the releases of DeepSpeech and if I am not mistaken for the v0.1.1 there is the model for the combined dataset of librispeech, fisher, tedlium and switchboard. This is understandable since English models can be considered as the golden standard for the ASR performance, and the training has been done with corpus that are widely used by the speech community.

However for the case of minority languages like Welsh or Catalan for example in the future, it will be interesting or in fact maybe necessary to have some initial DeepSpeech models trained with less than 1k hours of data in order for the community to start testing and integration.

If this was already discussed, sorry for the duplication. Please let us know what would be the workflow/timeline concerning the model training, especially from the perspective of the minority languages which depend on Common Voice for kickstarting their acoustic data. (maybe partly related to this topic)


First, thanks for such a thoughtful question.

As brevity is the soul of wit, we don’t have a timeline, but we will release models in languages other than English as Common Voice collection allows.

To frame this answer a bit more. Different languages will collect at different rates. This is a function of many things: number of speakers of the language, internet usage among speakers of the language, popularity of Common Voice among speakers of the language, cultural norms among speakers of the language…

Hence, setting one size-fits-all hard limits, e.g. after the first 1k hours are collected we will create a model, is not in any way fair to languages with fewer speakers, or languages with low internet usage, or languages that, for any number of extenuating circumstances, can not collect 1k hours in a reasonable amount of time.

The answer is of course to create models at different rates for different languages. For example, if we only controlled for population, create a new Chinese (China) model for every 2358 more hours of data and create a new Welsh model for every 1 more hour. The problem is how to decide upon these rates. Naively, one would only control for population. However, this would not control for any of the other factors mentioned above: internet usage among speakers of the language, popularity of Common Voice among speakers of the language, cultural norms among speakers of the language… How to control for these factors I think will only become clear as we collect data for languages other than English. Thus, the timeline will also only become clear as collection progresses.

Another problem which arises is model performance in the face of little data. Say, for example, we had 100 hours of Welsh data and want to create a model. 100 hours is not sufficient to create a speech recognition engine matching the quality of say Google. So, there is a research aspect that also must be addressed: How to create high quality models with low-resource languages.

To that end, we have applied to NSF to fund research in this direction and hopefully will be able to bring more resources to bear on the problem towards the end of the year.

Interestingly enough this same problem, model performance in the face of little data, can also be used to enhance models for assistive technologies. For example, creating models geared towards certain speech impediments, where there may exist a small data set exemplifying the impediment that we can train on. We plan to also try and address this problem too.

Yet another problem which arises is tuning model hyperparameters to a particular language. For example, Chinese (China) has about 50k to 100k characters while English has only 28, the normal 26 plus space and an apostrophe. This type of difference requires tuning of the model to the language, a human task bottlenecked by the small the number of team member we have on the machine learning team at Mozilla, 4.5 to be exact and not all of them are working on speech recognition.

So you see it’s complicated. However, we hope to deliver more clarity as we see how data collection proceeds in non-English languages.