4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download

phirework · January 14, 2020, 5:39pm

We hope you all had a great start to the New Year!

We’re very happy to announce the release of the 4200h Voice Dataset Release*. Thanks to all of you, your hard work and amazing engagement, this dataset has 4,257 total hours of contributed voice data, which is an incredible 70% increase in total hours compared to June 2019! With the help of dedicated community contributors, we’ve scraped Wikipedia for sentence collection in the following languages, and the effort has paid off:

English - 1,488 hours recorded, 1,118 hours validated
German - 538 hours recorded, 483 hours validated
French - 412 hours recorded, 350 hours validated
Catalan - 295 hours recorded, 245 hours validated
Spanish - 221 hours recorded, 167 hours validated
Italian - 122 hours recorded, 85 hours validated

It’s not just the total number of hours that’s grown. This dataset includes voice recordings in 40 languages, featuring 11 new languages that have been added by our communities since June 2019: Abkhazian, Arabic, Chinese (Hong Kong), Indonesian, Interlingua, Japanese, Latvian, Portuguese, Romansh (Sursilvan), Tamil, and Votic. With ~259,000 contributors from around the world, Common Voice is more diverse than ever.

On behalf of the whole Voice team at Mozilla: Thank you all for your ongoing contributions, your support and creativity, your thoughtfulness, and your patience!

We would love to get your feedback on this new dataset. While our DeepSpeech team is currently running their own tests, we rely on the community to help us make future data collection even more valuable in terms of quality, diversity, and usage potential. Let us know what you’re working on, and please continue to share your findings with your peers and with our team here on Discourse and on the Common Voice Slack. (If you need more information about how the dataset is split, please refer to the Corpora Creator.) And in case you haven’t seen it already: our latest DeepSpeech version 0.6 includes a host of performance optimizations, making it easier for application developers to use the engine out of the box.

Thank you for helping advance the development of decentralized and open voice technologies, and we can’t wait to see what you come up with.

*This dataset was compiled on December 10th, 2019

belkacem77 · January 14, 2020, 5:57pm

And for Kabyle we reached 315h untill now

I hope we can begin training as soon as possible as we are gathering texts for the language model from available sources.

Thanks for all

lissyx · January 14, 2020, 6:03pm

Release contains validated 262h

stergro · January 14, 2020, 6:14pm

Grat news! The only little thing that is sad is that since I thought that the release is in January I advertised this project a lot in the last weeks to expand the number of validated hours. Since the realease was already compiled on December 10th, 2019 all this work will be usable just next august or so. But it is still work that will help the project, we just can’t use it now

belkacem77 · January 14, 2020, 6:14pm

@lissyx
Seems to be ok to launch first trained models?

lissyx · January 14, 2020, 6:19pm

Unfortunately, I might not have enough brain time those days.

phirework · January 14, 2020, 6:27pm

I hear you, the time lag between contributing and being able to access the voice data is something we’re very aware of and thinking about. Thank you for being patient in the meantime

belkacem77 · January 14, 2020, 6:42pm

I need time to set up my own workstation

halder.nayan35 · January 15, 2020, 10:56am

i want to download Indian English. pls advise. do i have to download the whole 38 GB English and then filter out Indian English.

nukeador · January 15, 2020, 11:41am

Yes, the metadata is included as part of the description file, we don’t have a separate file for each accent.

halder.nayan35 · January 15, 2020, 12:27pm

thanks a lot. how much Indian hour data is available can you please let me know.

nukeador · January 15, 2020, 12:31pm

According to the datasets page there are 4% of India and South Asia (India, Pakistan, Sri Lanka) accents, that’s around 59 hours from the total 1488.

halder.nayan35 · January 15, 2020, 12:35pm

that’s great i will download the dataset and also donate indian english voices. i hope i will also get the transcript.

nukeador · January 15, 2020, 12:39pm

Great, let us know how you are using the data, we really want to know in order to optimize the dataset in the future!

halder.nayan35 · January 15, 2020, 12:41pm

i am trying to build ASR using deepspeech for indian english. what type of optimization you are referring here please let me know.

nukeador · January 15, 2020, 12:44pm

We want to understand how people are using the data, which applications, and what are the things they would like to see on the dataset in the future.

Once you had the chance to download and use the data, we’ll appreciate some feedback and comments about the dataset and how useful it was.

Thanks!

stergro · January 18, 2020, 7:00pm

I haven’t seen any articles about this release. Will there be a press release about it?

phirework · January 20, 2020, 5:02pm

We’re considering this a soft-launch for now. Due to timing of holidays as well as Mozilla’s upcoming all-hands next week it’s been a little difficult to coordinate all internal stakeholders, and we wanted to release the data asap. There will be an official announcement once all of that is sorted out. (cc @aklepel)

aklepel · January 20, 2020, 5:22pm

Thanks Jenny. @stergro, we’ll integrate the data release comms in the next larger announcement to create an impactful cycle, but it was important for us to get the data out into people’s hands.

11127 · April 20, 2020, 2:42pm

There are many types of languages, hoping to make a second classification of more languages, for example, Portuguese is divided into Brazilian Portuguese and Portuguese Portuguese.