Common Voice mid-year release - more data, more languages!

The Common Voice Team is excited to announce the release of a new dataset that includes 2,366 total hours of contributed voice data!

The project has seen a spike in contributions and launches of many new languages over the past six months. We want to make sure to release data for use by the community quickly and efficiently. To do this, we’ve moved forward with a mid-year release including all recorded clips in 28 languages, available on the Datasets page on Common Voice.

The new languages being released today are, Basque, Chinese (Simplified), Dhivehi, Estonian, Kinyarwanda, Mongolian, Russian, Sakha, Spanish, and Swedish – some of these are the first ever publicly available datasets for these languages.

We realize that research projects will need version identification and are handling this by language through our naming convention: language, total number of hours and date released.


e.g. en_1085h_2019-06-12

We look forward to your feedback and continued contribution as we collaborate to advance the development of open voice technologies.

As promised, we will soon be sharing for community input a more detailed proposal for a longer-term dataset strategy , which is likely to include a predictable data release cycle.

Finally, the whole Common Voice team wants to extend a hearty thank you to this great community and everyone who has contributed or validated voices.


Is anyone else having problems with the checkboxes for the download?

I am using Chrome and went to the download page, I click the “Enter Email to Download” button, and type in my email (which is a valid email) and then it autoscrolls me a bit too far down the page and when I go back up to select the checkboxes (" You are prepared to initiate a download of 30 GB" and " You agree to not attempt to determine the identity of speakers in the Common Voice dataset") it won’t let me select them.

As a result the button is greyed out, so I can’t start the download.

Maybe I’m missing something else I need to do, but I don’t believe I am. It makes no difference if I let Chrome autopopulate the email address or type it manually.

Anything I should try?

Something is up and I’m not quite sure what action I did to enable the button, but even though the checkboxes aren’t checked the button stopped being disabled.

I don’t think it was triggered by me going into Inspection mode in the Chrome Dev tools but that’s when I noticed it.

Anyway, the download is proceeding now - thanks (again) for this great project!!

1 Like

The last release was 1400 hours in February. So in less than four months, 1000 additional hours were added. Impressive!


Any plans in the works for Mozilla themselves to use this data to make something for Firefox? Something like a voice assistant?

Definitely we want to use DeepSpeech models in Mozilla products, but first we need to have a more mature models that can provide the quality users expect from other commercial solutions.

Dear Common Voice developers and Mozilla community,

first of all, thank you very much for this initiative. It’s
indeed a great contribution for small companies and researchers.

My question is related to how the deliveries of the same language
are handled. One month ago I downloaded the German dataset, and today I
saw that there was a new release and downloaded it too.

I was expecting that every new release would be like a superset of
the previous, perhaps with some small changes here and there, but to my
surprise I found that speaker ids and paths seem to be different in the
new release.

Did I do anything wrong, or is this the way releases are handled,
with new ids every time? It’s an awesome resource in any case, but
knowing this would help me to decide how to process it on my side.

Thank you very much in advance!

@gweber might be able to provide an answer here.

Gracias, Rubén :slight_smile:

How’s this list organized? Shall I contact Gregor and come back and post his answer, or will he in some moment reply directly here?

Hi again! :slight_smile:

Here you are a couple of examples:

  • The recording d11bf898f22052314d40e3a9c33cc867658f84c2d94160b916d107ea2ecffbeb61cb064c651febdd36a62676ac8adf5bc8e1b7b8a055d0ea2c0ea249b1dc8ef7 from the first German corpus release seems to correspond to c7be2ec00185978f6f257065ad8a0eaa960d1e13c3c7e4446113583ddf0c2a0fe1df9a67ca40bf630300af7a74c8ad6e50f814e734c6a09b515eca216044d2ef in the new one (both of them have ‘b646cfa1d4d95022ddb8cde5cd902aa7’ as md5 hash).
  • Similarly, 51f49c0876669fef47709942270773c84f5e22e900219b0d8f178b150281cb970037fc959628ecfe60bc12eba8cc423575ff8148f448f4821f6497c175a9cad1 in the first release seems to correspond to 72e4ab2cb4c003fecb628b1dba05c1edd235c9baa3178db0b4dbf54b50fad926e45a4f379078653520d1d23c4c183c52363993db5ed280e7bf0d1a2d7b0bc14c in the second one (‘27c6560188c3abf6c8f2e8cf5db0fd9a’ hash).

Moreover, speaker ids are not only different in both releases: in the first one both recordings are assigned to two different speaker ids (ea796714c86616b68df08a47cdc23c4302ee43a19c3b85098092787a0b897f8a4851208ddb4ee4cde150e7cc50742f1f2f72fee38a62d4a9550e95c3a77a0abc and a138994876901418bbe9f2f1e6770ff07ce63828be2375f846b5e0d323a9cbde4bf74bd58250b03b133d04e32be404ee43aa97db9a0e45acdddfd0511d654caf), whereas in the second release both of them are assigned to the same speaker (701fa16b01f2b55c04f0f2e485445ad55499fc8a4f4d000dfd0236ec77df73e2552fa8a65feb6ed03e1725e7a075d389edd2ea139d3367072ce588bb37c0a87f).

Can we also expect speaker id corrections in each new release, or is there any processing issue? I listened to the recordings, but unfortunately the second one is so muffled that I’m not able to discern whether it’s the same speaker or not.

If anyone could clarify these issues, I would really appreciate it…

Thank you very much in advance!

1 Like

@gweber will comment here, thanks.

Heya, thanks for looking into this. They were indeed supposed to be consistent, I’ll investiage next week and possibly issue a re-release (after speaking with the team).

Recordings from the same account should indeed have the same client_id as well.

1 Like

Thank you very much, Gregor :slight_smile:

Since we are already with the topic of the ids, I guess there’s a strong reason to choose these long sausage-like identifiers, but for some applications it would be handier something like “common_voice_${language}${utt|spk}${index}”, for recording and speaker ids, respectively. Debugging and trouble shooting would be much easier like that.

Just my opinion, of course… :slight_smile:

Thanks for the input! :slight_smile:

I for some reason thought we didn’t have a stable numeric index when I wrote the bundler. We actually do! So in the next release (and in all the following ones) it’ll be the format you recommended, minus the utt|spk part. Will probably also issue a re-release very soon, but I’ll have to check with the team again to make sure we get it right.

Wow, I have mini-contributed! :smiley:

Excellent, thank you very much.

I guess it’s possible that some speakers will contribute in several languages. If that’s the case, a unique speaker id for the whole database (independent from the language) might be more suitable. For example, it would be very useful for cross-language speaker verification experiments.