The multi-language dataset is now available to the Common Voice community as a beta release! This release includes all new, multi-language data that has been collected in 2018. There are two reasons for choosing a community-focused beta release. First, the data in this release is raw. The Common Vo…

Hello Everyone, the new dataset is back up and ready for use!

Hi, thanks for your hard work on making this release happen! I imagine you’re getting tons of requests for access to the voice data right now. Is giving access a manual process (i.e., giving access only after review of each form submission by a human)?

Hi there, If you would like voice access you can fill out the form above and we will be sending a link via email at the beginning of each day. Each of the sentences you hear has been reviewed by 2 humans to ensure its correctness. Does that answer your question?

Thanks, that fully answers my question. I filled out the form above earlier today (shortly after your post), but haven’t receive any email, so I was just wondering if it had to be approved first. Again, thanks for your work on this and I’m looking forward to working with the data!

You should be getting your email shortly!

Hi @r_LsdZVv67VKuK6fuHZ_tFpg , I’ve found there is an audio file broken in zh-TW dataset. 53777c75a47473ca6101ac395e74d3a8e9b66f2ad58ce3d7defc1a22761f5f0b7072ddf8d62fd06be02a4843587ea1322c29f90b61edf99cc608981306dc35e4.mp3 This audio is in other.tsv. Finally, thanks for release.

Thanks for the heads up @areyliu6 ! @gregor Do you need any further information about the break? Lets review in the next sprint meeting.

I downloaded the kab dataset but I can’t find the transcript (sentences). I got only the audio files. I’m going to train the first dataset using deepspeech to show it on an event we are going to organize to show the importance and recruit more recorders from Kabylia. Thanks again for the release.

The sentences are in the clips.tsv file. If you want to get them split up by language, validity & bucket, you need to run the CorporaCreator on the file.

Thanks again for help :heart_eyes::heart_eyes::heart_eyes: wonderfull

Multi-language Dataset Beta Release

Common Voice

nukeador (Rubén Martín [❌ taking a break from Mozilla]) June 12, 2019, 11:18pm 22

Today we have released a new version of the dataset and keep improving the automation of the process.

Topic		Replies	Views
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3873	April 21, 2020
Common Voice Dataset Release - Mid Year 2020 Common Voice announcements	16	24296	August 21, 2020
Add in dataset Sakha language Common Voice dataset	5	1312	April 25, 2019
Common Voice 2021 Mid-year Dataset Release! Common Voice announcements , dataset	8	2848	August 4, 2021
Common Voice mid-year release - more data, more languages! Common Voice announcements , dataset	20	2542	August 12, 2019

Multi-language Dataset Beta Release

Related topics