We hope you all had a great start to the New Year!
We’re very happy to announce the release of the 4200h Voice Dataset Release*. Thanks to all of you, your hard work and amazing engagement, this dataset has 4,257 total hours of contributed voice data, which is an incredible 70% increase in total hours compared to June 2019! With the help of dedicated community contributors, we’ve scraped Wikipedia for sentence collection in the following languages, and the effort has paid off:
- English - 1,488 hours recorded, 1,118 hours validated
- German - 538 hours recorded, 483 hours validated
- French - 412 hours recorded, 350 hours validated
- Catalan - 295 hours recorded, 245 hours validated
- Spanish - 221 hours recorded, 167 hours validated
- Italian - 122 hours recorded, 85 hours validated
It’s not just the total number of hours that’s grown. This dataset includes voice recordings in 40 languages, featuring 11 new languages that have been added by our communities since June 2019: Abkhazian, Arabic, Chinese (Hong Kong), Indonesian, Interlingua, Japanese, Latvian, Portuguese, Romansh (Sursilvan), Tamil, and Votic. With ~259,000 contributors from around the world, Common Voice is more diverse than ever.
On behalf of the whole Voice team at Mozilla: Thank you all for your ongoing contributions, your support and creativity, your thoughtfulness, and your patience!
We would love to get your feedback on this new dataset. While our DeepSpeech team is currently running their own tests, we rely on the community to help us make future data collection even more valuable in terms of quality, diversity, and usage potential. Let us know what you’re working on, and please continue to share your findings with your peers and with our team here on Discourse and on the Common Voice Slack. (If you need more information about how the dataset is split, please refer to the Corpora Creator.) And in case you haven’t seen it already: our latest DeepSpeech version 0.6 includes a host of performance optimizations, making it easier for application developers to use the engine out of the box.
Thank you for helping advance the development of decentralized and open voice technologies, and we can’t wait to see what you come up with.
*This dataset was compiled on December 10th, 2019