10,000 hours of validated speech

cib · June 16, 2018, 3:51pm

I discovered the project early this year. Back then, if I remember correctly, it was a little under 300 hours of validated speech? And now we’re sitting at around 400?

Pretty much all the news releases about the Common Voice project state that the goal is to collect 10,000 hours of validated speech. I can’t find any statement by Mozilla on that, though. (Was this originally stated on the Common Voice website and later removed?)

If the stats on the Common Voice website are accurate, that’s currently not going to happen. With only a few hundred hours in half a year, it would probably take several decades to collect 10,000 hours of validated speech.

So does that mean the original expectations were too optimistic? Or are all those news outlets wrong and 10,000 hours was never the goal? Or are the numbers on the website simply wrong? But if it was the goal, and the numbers are correct, what does the fact that we aren’t anywhere close to the original expectations mean for the project?

J-b · June 18, 2018, 8:34pm

I can’t speak for the Common Voice team, but you can find some information here:

As for LibriSpeech, the DeepSpeech team at Mozilla does use this data for training. However, the language is pretty antiquated, and we only get about 1K hours of data, whereas you need about 10K hours to get to a decent accuracy (WER of 10% and below). Common Voice is about adding to public corpora like LibraSpeech, not replacing them.

From this comment, I don’t think that 10k hours is a short-term goal for Common Voice. As far as I understand it, the goal is to increase of the overall volume of publicly available corpuses, to make it closer to 10k.

But someone from the Common Voice team may correct me if I’m mistaken!

mhenretty · June 22, 2018, 1:48pm

Hi both, thanks for the discussion.

Indeed our long-term goal is 10K hours for each language. That said, we may be able to build usable speech technology with far less data (for instance, by using transfer learning with our existing english models).

But in addition to 10K hours, we also want to start creating “usable datasets” in many languages that don’t already have usable datasets. Because if you can build a V1 of a speech product in a language like Chuvash, it’s possible to use that product to start collecting more data. But without any Chuvash data, it’s impossible to start.

In addition to kickstarting languages, we are also trying to make our site more fun to use so that people donate more and come back more often. To look at some of this work, check out our upcoming wireframes:
http://bit.ly/cv-desktop-ux

cib · June 24, 2018, 10:53am

In addition to kickstarting languages, we are also trying to make our site more fun to use so that people donate more and come back more often. To look at some of this work, check out our upcoming wireframes:

That seems like a really good idea. Thanks for sharing the mockup, I actually left a comment, hope that’s OK.

In terms of motivating users, I feel that having more feedback on the user’s personal progress would help. For example, if you contribute speech, how many of your clips have been validated. Or when validating clips, how many of the clips you helped validate have made it into the dataset.

The way it’s set up now, it feels like you’re making absolutely no impact on the total stats (which in a way I guess is true, statistically speaking, but all the more important to emphasize the user’s individual contributions).

mhenretty · June 25, 2018, 8:28am

Yup, agreed, and we are working on exactly this issue right now. Stay tuned!