Validation Drive + Call for Collaboration: Ahead of the next dataset release

Hey Common Voice Community,

Happy New Month and Happy Diwali for people who celebrate !

Voice Clip Validation Drive: Ahead of December 15th Dataset release

:postal_horn:The next dataset release is tentatively on December 15th 2021 :sparkles:, with voice clip contribution cut off period approximately 10 days before.

Validating voice clips support the quality of the Common Voice Dataset so application created using the data can better understand people.

I would like to encourage you to create or collaborate on community activities that can help mobilise people to support voice clip validation. Feel free to use and adapt the resources from the social media campaign and community portal.

At the end of the month, I would also like to host validation parties :partying_face: to support the validation of voice clips. These will be open to the community dates are Monday 29th November 6-7pm (UTC), Thursday 2nd December UTC, 2-3pm, 6-7pm (UTC), registration will open soon.

You can now register to attend the Validation Parties via tito

Contribute-a-thon call for collaborations - Tentatively Saturday 4th December

Last month we started hosting monthly sessions focusing on learning, sharing and creating with Common Voice Dataset.

I’m reaching out to ask if two language communities would be interested in collaborating on a community event focusing on either voice clip validation or sentence collection (if the language is not launched)?

The events would take place tentatively Saturday 4th December. If you would like to take part please direct message me or respond on the thread.

For future Contribute-a-thon’s in 2022, next week will be opening an expression of interest form for communities who would like to collaborate with me on a virtual Community sprint/event.

1 Like

Hello Hillary!
I wanted to ask if this is the exact date of release (Dec 15th)? We need to be sure if we can download dataset of Uzbek speech before December 20th, as we have been planning a hackaton that requires this dataset (to create a valid and useful model with DeepSpeech).
It is the first hackaton of such kind in Uzbekistan, so we are really trying hard for this. I would appreciate the support from Common Voice developers community.

We have almost 95 hours of recordings (only 60 hours of which are validated). I was wondering if it’s possible to get the dataset only for Uzbek language if by any change the release date is postponed.