The Common Voice Team is excited to announce the release of a new dataset that includes 2,366 total hours of contributed voice data!
The project has seen a spike in contributions and launches of many new languages over the past six months. We want to make sure to release data for use by the community quickly and efficiently. To do this, we’ve moved forward with a mid-year release including all recorded clips in 28 languages, available on the Datasets page on Common Voice.
The new languages being released today are, Basque, Chinese (Simplified), Dhivehi, Estonian, Kinyarwanda, Mongolian, Russian, Sakha, Spanish, and Swedish – some of these are the first ever publicly available datasets for these languages.
We realize that research projects will need version identification and are handling this by language through our naming convention: language, total number of hours and date released.
<LOCALE>_<TOTAL_INCLUDING_UNVALIDATED_HOURS>h_<ISO_DATE>
e.g. en_1085h_2019-06-12
We look forward to your feedback and continued contribution as we collaborate to advance the development of open voice technologies.
As promised, we will soon be sharing for community input a more detailed proposal for a longer-term dataset strategy , which is likely to include a predictable data release cycle.
Finally, the whole Common Voice team wants to extend a hearty thank you to this great community and everyone who has contributed or validated voices.