4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download

We hope you all had a great start to the New Year!

We’re very happy to announce the release of the 4200h Voice Dataset Release*. Thanks to all of you, your hard work and amazing engagement, this dataset has 4,257 total hours of contributed voice data, which is an incredible 70% increase in total hours compared to June 2019! With the help of dedicated community contributors, we’ve scraped Wikipedia for sentence collection in the following languages, and the effort has paid off:

  • English - 1,488 hours recorded, 1,118 hours validated
  • German - 538 hours recorded, 483 hours validated
  • French - 412 hours recorded, 350 hours validated
  • Catalan - 295 hours recorded, 245 hours validated
  • Spanish - 221 hours recorded, 167 hours validated
  • Italian - 122 hours recorded, 85 hours validated

It’s not just the total number of hours that’s grown. This dataset includes voice recordings in 40 languages, featuring 11 new languages that have been added by our communities since June 2019: Abkhazian, Arabic, Chinese (Hong Kong), Indonesian, Interlingua, Japanese, Latvian, Portuguese, Romansh (Sursilvan), Tamil, and Votic. With ~259,000 contributors from around the world, Common Voice is more diverse than ever.

On behalf of the whole Voice team at Mozilla: Thank you all for your ongoing contributions, your support and creativity, your thoughtfulness, and your patience!

We would love to get your feedback on this new dataset. While our DeepSpeech team is currently running their own tests, we rely on the community to help us make future data collection even more valuable in terms of quality, diversity, and usage potential. Let us know what you’re working on, and please continue to share your findings with your peers and with our team here on Discourse and on the Common Voice Slack. (If you need more information about how the dataset is split, please refer to the Corpora Creator.) And in case you haven’t seen it already: our latest DeepSpeech version 0.6 includes a host of performance optimizations, making it easier for application developers to use the engine out of the box.

Thank you for helping advance the development of decentralized and open voice technologies, and we can’t wait to see what you come up with.

*This dataset was compiled on December 10th, 2019

12 Likes

And for Kabyle we reached 315h untill now :grinning:

I hope we can begin training as soon as possible as we are gathering texts for the language model from available sources.

Thanks for all

1 Like

Release contains validated 262h :slight_smile:

1 Like

Grat news! The only little thing that is sad is that since I thought that the release is in January I advertised this project a lot in the last weeks to expand the number of validated hours. Since the realease was already compiled on December 10th, 2019 all this work will be usable just next august or so. But it is still work that will help the project, we just can’t use it now :slight_smile:

2 Likes

@lissyx
Seems to be ok to launch first trained models?

Unfortunately, I might not have enough brain time those days.

1 Like

I hear you, the time lag between contributing and being able to access the voice data is something we’re very aware of and thinking about. Thank you for being patient in the meantime :slight_smile:

3 Likes

I need time to set up my own workstation

i want to download Indian English. pls advise. do i have to download the whole 38 GB English and then filter out Indian English.

1 Like

Yes, the metadata is included as part of the description file, we don’t have a separate file for each accent.

thanks a lot. how much Indian hour data is available can you please let me know.

According to the datasets page there are 4% of India and South Asia (India, Pakistan, Sri Lanka) accents, that’s around 59 hours from the total 1488.

that’s great i will download the dataset and also donate indian english voices. i hope i will also get the transcript.

Great, let us know how you are using the data, we really want to know in order to optimize the dataset in the future! :slight_smile:

i am trying to build ASR using deepspeech for indian english. what type of optimization you are referring here please let me know.

We want to understand how people are using the data, which applications, and what are the things they would like to see on the dataset in the future.

Once you had the chance to download and use the data, we’ll appreciate some feedback and comments about the dataset and how useful it was.

Thanks!

I haven’t seen any articles about this release. Will there be a press release about it?

We’re considering this a soft-launch for now. Due to timing of holidays as well as Mozilla’s upcoming all-hands next week it’s been a little difficult to coordinate all internal stakeholders, and we wanted to release the data asap. There will be an official announcement once all of that is sorted out. (cc @aklepel)

3 Likes

Thanks Jenny. @stergro, we’ll integrate the data release comms in the next larger announcement to create an impactful cycle, but it was important for us to get the data out into people’s hands.

2 Likes

There are many types of languages, hoping to make a second classification of more languages, for example, Portuguese is divided into Brazilian Portuguese and Portuguese Portuguese.