Dear Common Voice Community,
We are excited to announce the Common Voice 2021 Mid-year Dataset Release!
Your incredible contributions and community activities have made this latest version of the Common Voice Dataset possible. The dataset has grown to 13,905 hours and includes voice recordings in 76 languages, 16 of which are new to the platform and dataset. We’re excited to welcome Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, and Hausa to the community.
- The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and Esperanto (840).
- Languages that have increased the most by percentage are Thai (almost 20x growth, from 12 hours to 250 hours), Luganda (almost 9x growth, from 8 to 80), Esperanto (7x growth, from 100 to 840), and Tamil (almost 8x, from 24 to 220).
Learn more about the release details and metadata on Common Voice GitHub.
Dataset ‘Ask Me Anything’, 4th August
In celebration of the dataset release, on 4th August, 3-4 pm UTC we are hosting an Ask Me Anything discussion with our Lead Engineer Jenny Zhang. Jenny will be answering your questions live on discourse. To join and ask a question please use the following AMA discourse topic.
Common Voice 2021 Open Roadmap Sessions
We are excited to build on the amazing work of last year. Learn more about the roadmap, and participate in important discussions about the future, at the interactive Common Voice Open roadmap session about our plans for the next year.
Thank you for your continuous support for the Common Voice mission to create a voice dataset that is publicly available to anyone and represents the world we live in.
Thank you, on the behalf of the Common Voice Team