Common Voice 2021 Mid-year Dataset Release!

Dear Common Voice Community,

:partying_face: We are excited to announce the Common Voice 2021 Mid-year Dataset Release!

Your incredible contributions and community activities have made this latest version of the Common Voice Dataset possible. The dataset has grown to 13,905 hours and includes voice recordings in 76 languages, 16 of which are new to the platform and dataset. We’re excited to welcome Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, and Hausa to the community.

Dataset Highlights!

  • The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and Esperanto (840).
  • Languages that have increased the most by percentage are Thai (almost 20x growth, from 12 hours to 250 hours), Luganda (almost 9x growth, from 8 to 80), Esperanto (7x growth, from 100 to 840), and Tamil (almost 8x, from 24 to 220).

Learn more about the release details and metadata on Common Voice GitHub.

Dataset ‘Ask Me Anything’, 4th August

In celebration of the dataset release, on 4th August, 3-4 pm UTC we are hosting an Ask Me Anything discussion with our Lead Engineer Jenny Zhang. Jenny will be answering your questions live on discourse. To join and ask a question please use the following AMA discourse topic.

Common Voice 2021 Open Roadmap Sessions

We are excited to build on the amazing work of last year. Learn more about the roadmap, and participate in important discussions about the future, at the interactive Common Voice Open roadmap session about our plans for the next year.

Thank you for your continuous support for the Common Voice mission to create a voice dataset that is publicly available to anyone and represents the world we live in.

Thank you, on the behalf of the Common Voice Team
:sparkles:

8 Likes

Thank you and everyone on the team for this! Are there any types of diffs/changelogs between this and version 6.1? I am currently using English version 6.1 and would like to migrate to 7.0, but doing the minimum of work required since there are so many files.

Thanks again.

Hey @robertbracco1,

You are welcome, thanks so much for participating.

On our GitHub repo, you can access the changelogs here: https://github.com/common-voice/cv-dataset/blob/main/CHANGELOG.md

If you have any other questions, please do not hesitate to ask !

Hey, yesterday I talked to some folks from a local hackerspace about CV and the reaction of them was one that I have heard a lot in the last year:

“What Common Voice still exists? I thought this project had been stopped during the Mozilla layoffs last year.”

Maybe doing an official press release about the new dataset would be helpful, many people still think that this project is dead by now. I also heard this a lot in online discussions.

1 Like

Hey Stefan,

Thanks for sharing this with us.

In fact a blog on Mozilla Foundation went out regarding the dataset release and will also be shared with our mailing list.

I can also share suggestion regarding the press release to ensure contributors who have previously engaged with the project are aware of the changes.

1 Like

Many people learn Common Voice through DeepSpeech team and doesn’t know that they are different projects. Also look like Common Voice is only project left from previous voice.mozilla.org

Hey Irvin,

Thanks for sharing this feedback, in response to yours and Stefan’s point I am looking into re-engaging with previous contributors as part of the Community strategy.

At the moment, I have shared an update post on DeepSpeech channel.

1 Like

Absolutely. I believe collaborating with coqui.ai would really help the project. People from the old deepspeech team forked it and founded this company.

A collaboration with other systems, such as vosk speech recognition could be beneficial as well. Vosk recently got integrated into the video editor Kdenlive to create subtitles.

We don’t need a very deep collaboration, just some support for easy training of models. For example, creating compatible data formats and import scripts. People are already doing this, but these things are hard to find and hard to adapt.