Common Voice mid-year release - more data, more languages!

r_LsdZVv67VKuK6fuHZ_tFpg · June 12, 2019, 1:51pm

The Common Voice Team is excited to announce the release of a new dataset that includes 2,366 total hours of contributed voice data!

The project has seen a spike in contributions and launches of many new languages over the past six months. We want to make sure to release data for use by the community quickly and efficiently. To do this, we’ve moved forward with a mid-year release including all recorded clips in 28 languages, available on the Datasets page on Common Voice.

The new languages being released today are, Basque, Chinese (Simplified), Dhivehi, Estonian, Kinyarwanda, Mongolian, Russian, Sakha, Spanish, and Swedish – some of these are the first ever publicly available datasets for these languages.

We realize that research projects will need version identification and are handling this by language through our naming convention: language, total number of hours and date released.

<LOCALE>_<TOTAL_INCLUDING_UNVALIDATED_HOURS>h_<ISO_DATE>

e.g. en_1085h_2019-06-12

We look forward to your feedback and continued contribution as we collaborate to advance the development of open voice technologies.

As promised, we will soon be sharing for community input a more detailed proposal for a longer-term dataset strategy , which is likely to include a predictable data release cycle.

Finally, the whole Common Voice team wants to extend a hearty thank you to this great community and everyone who has contributed or validated voices.

nmstoker · June 12, 2019, 7:37pm

Is anyone else having problems with the checkboxes for the download?

I am using Chrome and went to the download page, I click the “Enter Email to Download” button, and type in my email (which is a valid email) and then it autoscrolls me a bit too far down the page and when I go back up to select the checkboxes (" You are prepared to initiate a download of 30 GB" and " You agree to not attempt to determine the identity of speakers in the Common Voice dataset") it won’t let me select them.

As a result the button is greyed out, so I can’t start the download.

Maybe I’m missing something else I need to do, but I don’t believe I am. It makes no difference if I let Chrome autopopulate the email address or type it manually.

Anything I should try?

nmstoker · June 12, 2019, 7:42pm

Something is up and I’m not quite sure what action I did to enable the button, but even though the checkboxes aren’t checked the button stopped being disabled.

I don’t think it was triggered by me going into Inspection mode in the Chrome Dev tools but that’s when I noticed it.

Anyway, the download is proceeding now - thanks (again) for this great project!!

dabinat · June 12, 2019, 10:11pm

The last release was 1400 hours in February. So in less than four months, 1000 additional hours were added. Impressive!

parkerhasmail · June 13, 2019, 12:56pm

Any plans in the works for Mozilla themselves to use this data to make something for Firefox? Something like a voice assistant?

nukeador · June 13, 2019, 1:40pm

Definitely we want to use DeepSpeech models in Mozilla products, but first we need to have a more mature models that can provide the quality users expect from other commercial solutions.

minstrangeland · June 15, 2019, 5:48pm

Dear Common Voice developers and Mozilla community,

first of all, thank you very much for this initiative. It’s
indeed a great contribution for small companies and researchers.

My question is related to how the deliveries of the same language
are handled. One month ago I downloaded the German dataset, and today I
saw that there was a new release and downloaded it too.

I was expecting that every new release would be like a superset of
the previous, perhaps with some small changes here and there, but to my
surprise I found that speaker ids and paths seem to be different in the
new release.

Did I do anything wrong, or is this the way releases are handled,
with new ids every time? It’s an awesome resource in any case, but
knowing this would help me to decide how to process it on my side.

Thank you very much in advance!

nukeador · June 17, 2019, 12:23pm

@gregor might be able to provide an answer here.

minstrangeland · June 18, 2019, 6:24am

Gracias, Rubén

How’s this list organized? Shall I contact Gregor and come back and post his answer, or will he in some moment reply directly here?

minstrangeland · June 19, 2019, 12:30pm

Hi again!

Here you are a couple of examples:

The recording d11bf898f22052314d40e3a9c33cc867658f84c2d94160b916d107ea2ecffbeb61cb064c651febdd36a62676ac8adf5bc8e1b7b8a055d0ea2c0ea249b1dc8ef7 from the first German corpus release seems to correspond to c7be2ec00185978f6f257065ad8a0eaa960d1e13c3c7e4446113583ddf0c2a0fe1df9a67ca40bf630300af7a74c8ad6e50f814e734c6a09b515eca216044d2ef in the new one (both of them have ‘b646cfa1d4d95022ddb8cde5cd902aa7’ as md5 hash).
Similarly, 51f49c0876669fef47709942270773c84f5e22e900219b0d8f178b150281cb970037fc959628ecfe60bc12eba8cc423575ff8148f448f4821f6497c175a9cad1 in the first release seems to correspond to 72e4ab2cb4c003fecb628b1dba05c1edd235c9baa3178db0b4dbf54b50fad926e45a4f379078653520d1d23c4c183c52363993db5ed280e7bf0d1a2d7b0bc14c in the second one (‘27c6560188c3abf6c8f2e8cf5db0fd9a’ hash).

Moreover, speaker ids are not only different in both releases: in the first one both recordings are assigned to two different speaker ids (ea796714c86616b68df08a47cdc23c4302ee43a19c3b85098092787a0b897f8a4851208ddb4ee4cde150e7cc50742f1f2f72fee38a62d4a9550e95c3a77a0abc and a138994876901418bbe9f2f1e6770ff07ce63828be2375f846b5e0d323a9cbde4bf74bd58250b03b133d04e32be404ee43aa97db9a0e45acdddfd0511d654caf), whereas in the second release both of them are assigned to the same speaker (701fa16b01f2b55c04f0f2e485445ad55499fc8a4f4d000dfd0236ec77df73e2552fa8a65feb6ed03e1725e7a075d389edd2ea139d3367072ce588bb37c0a87f).

Can we also expect speaker id corrections in each new release, or is there any processing issue? I listened to the recordings, but unfortunately the second one is so muffled that I’m not able to discern whether it’s the same speaker or not.

If anyone could clarify these issues, I would really appreciate it…

Thank you very much in advance!

nukeador · June 19, 2019, 3:26pm

@gregor will comment here, thanks.

gregor · June 19, 2019, 3:34pm

Heya, thanks for looking into this. They were indeed supposed to be consistent, I’ll investiage next week and possibly issue a re-release (after speaking with the team).

Recordings from the same account should indeed have the same client_id as well.

minstrangeland · June 20, 2019, 7:30am

Thank you very much, Gregor

Since we are already with the topic of the ids, I guess there’s a strong reason to choose these long sausage-like identifiers, but for some applications it would be handier something like “common_voice_${language}${utt|spk}${index}”, for recording and speaker ids, respectively. Debugging and trouble shooting would be much easier like that.

Just my opinion, of course…

gregor · June 24, 2019, 7:24pm

Thanks for the input!

I for some reason thought we didn’t have a stable numeric index when I wrote the bundler. We actually do! So in the next release (and in all the following ones) it’ll be the format you recommended, minus the utt|spk part. Will probably also issue a re-release very soon, but I’ll have to check with the team again to make sure we get it right.

minstrangeland · June 25, 2019, 6:27am

Wow, I have mini-contributed!

Excellent, thank you very much.

I guess it’s possible that some speakers will contribute in several languages. If that’s the case, a unique speaker id for the whole database (independent from the language) might be more suitable. For example, it would be very useful for cross-language speaker verification experiments.

gregor · June 26, 2019, 9:25pm

The “new release” is now up on the site: https://voice.mozilla.org/datasets

Changelog:

Includes Farsi (/Persian)
new file naming

Speakers should indeed be (anonymously) identifiable across languages already.

minstrangeland · June 27, 2019, 1:07pm

Thank you very much for the new release, Gregor Now the utterance ids look definitely more manageable.

I’m still finding inconsistencies in the speaker ids between the current German release and the original one (2019-05-10):

In the new release, the utterances common_voice_de_17798603.mp3 (89387bc72f57ca7676ee1deb549d4444 hash) and common_voice_de_17799805.mp3 (b646cfa1d4d95022ddb8cde5cd902aa7 hash) seem to share the same speaker id: b38b88f0aa5821ad43960f48955ffef20324a589315f9c4e81687154da0246c7617784b3c28a4af21eb98a7413713ac1a7add9e3f3b6a14906796780e9126d44
In the 2019-05-10 release, the corresponding recordings d6d3fc0ee3d49de9cf55191f24a7b87a0cb9b600386c149c54957c101f49bf7311e9cb6c610d16c7cd2b9b0e2c3bb526f03ae91420c0ab85a5e08f81029a3f28.mp3 (89387bc72f57ca7676ee1deb549d4444 hash) and d11bf898f22052314d40e3a9c33cc867658f84c2d94160b916d107ea2ecffbeb61cb064c651febdd36a62676ac8adf5bc8e1b7b8a055d0ea2c0ea249b1dc8ef7.mp3 (b646cfa1d4d95022ddb8cde5cd902aa7 hash) have two different speaker ids: a138994876901418bbe9f2f1e6770ff07ce63828be2375f846b5e0d323a9cbde4bf74bd58250b03b133d04e32be404ee43aa97db9a0e45acdddfd0511d654caf and ea796714c86616b68df08a47cdc23c4302ee43a19c3b85098092787a0b897f8a4851208ddb4ee4cde150e7cc50742f1f2f72fee38a62d4a9550e95c3a77a0abc (respectively).

BTW, this message is a great example of how choosing simpler ids can make certain stuff much easier

Thank you very much in advance…

dowwie · July 10, 2019, 12:51pm

What criteria are you using to determine the “good enough” threshold required for production use of voice related models in development here for Mozilla products? I’m not familiar with how to evaluate the efficacy of models so a rubric that can help assess models would be valuable! Thoughts?

nukeador · July 10, 2019, 1:53pm

I’m not involved in these evaluations, but I suspect we don’t want to use a model that provides users a noticeable decrease of quality from the previous service used.

@george is currently working on the thinking about this and Mozilla’s overall voice strategy.

minstrangeland · August 12, 2019, 9:48am

@gregor . Is there any comment about this apparent inconsistency in the speaker ids between releases? Shall we just take the last release and forget about the previous ones?

Thank you very much again

Topic		Replies	Views
Multi-language Dataset Beta Release Common Voice announcements , dataset	23	5794	April 6, 2020
Inadequate Documentation Common Voice documentation	9	1615	September 23, 2022
Common Voice Dataset Release - Mid Year 2020 Common Voice announcements	16	24240	August 21, 2020
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3844	April 21, 2020
Dataset Release AMA Thread (Active: 4th August 3-4pm UTC) Common Voice dataset	12	5028	August 19, 2021

Common Voice mid-year release - more data, more languages!

Related topics