First, thanks for such a thoughtful question.
As brevity is the soul of wit, we don’t have a timeline, but we will release models in languages other than English as Common Voice collection allows.
To frame this answer a bit more. Different languages will collect at different rates. This is a function of many things: number of speakers of the language, internet usage among speakers of the language, popularity of Common Voice among speakers of the language, cultural norms among speakers of the language…
Hence, setting one size-fits-all hard limits, e.g. after the first 1k hours are collected we will create a model, is not in any way fair to languages with fewer speakers, or languages with low internet usage, or languages that, for any number of extenuating circumstances, can not collect 1k hours in a reasonable amount of time.
The answer is of course to create models at different rates for different languages. For example, if we only controlled for population, create a new Chinese (China) model for every 2358 more hours of data and create a new Welsh model for every 1 more hour. The problem is how to decide upon these rates. Naively, one would only control for population. However, this would not control for any of the other factors mentioned above: internet usage among speakers of the language, popularity of Common Voice among speakers of the language, cultural norms among speakers of the language… How to control for these factors I think will only become clear as we collect data for languages other than English. Thus, the timeline will also only become clear as collection progresses.
Another problem which arises is model performance in the face of little data. Say, for example, we had 100 hours of Welsh data and want to create a model. 100 hours is not sufficient to create a speech recognition engine matching the quality of say Google. So, there is a research aspect that also must be addressed: How to create high quality models with low-resource languages.
To that end, we have applied to NSF to fund research in this direction and hopefully will be able to bring more resources to bear on the problem towards the end of the year.
Interestingly enough this same problem, model performance in the face of little data, can also be used to enhance models for assistive technologies. For example, creating models geared towards certain speech impediments, where there may exist a small data set exemplifying the impediment that we can train on. We plan to also try and address this problem too.
Yet another problem which arises is tuning model hyperparameters to a particular language. For example, Chinese (China) has about 50k to 100k characters while English has only 28, the normal 26 plus space and an apostrophe. This type of difference requires tuning of the model to the language, a human task bottlenecked by the small the number of team member we have on the machine learning team at Mozilla, 4.5 to be exact and not all of them are working on speech recognition.
So you see it’s complicated. However, we hope to deliver more clarity as we see how data collection proceeds in non-English languages.