Accessing the extended version of a dataset

Hello, I’m doing a text-to-speech project using a Belarusian dataset.

I’ve noticed that there are 790 validated hours on the “languages” page, but there is only 275 validated hours version available for download.

Being able to choose data from aditional 500 hours will make an enormous difference in the quality of my model.

So can I somehow access this this extended corpus?

Thanks in advance! :slight_smile:

Hi @jhlfrfufyfn and welcome to the community :wave:

The next dataset release is expected to be released in early January. It will include all the new data that was contributed since July. So just one more month of patience and you will get it all :smile:

Have a nice weekend,
Michael

2 Likes

Hello,
I thought next release will be on the 15th of this month!
Regards,
Nart.

3 Likes

Hmm, there seems to be contradicting information.

In Weekly Update: 6th October 2021 - Next Dataset Release and Hacktoberfest @heyhillary said:

On https://github.com/common-voice/common-voice there’s this table:

Upcoming releases

Type Expected date More info
Platform code & sentences Dec 15, 2021 Release notes
Dataset Jan 2022 Dataset metadata

I don’t know what “sentences” in the first line refers to (maybe localizations), but the second line seems to be the dataset release.

I don’t know which is right :man_shrugging:

3 Likes

Well, that’s a pity, my project deadline is on the December 13th, i guess i’ll have to work with what i’ve got :slight_smile:
Thank you guys for info, have a nice weekend too!

2 Likes

Sentences reported via CV Speak/Validate for errors.
new reviewed Sentences via Sentence Collector.
Sentences from bulk submission(s).

Correcting the errors and packing the zip files also takes some time i guess.

2 Likes

Thanks for the clarification!

I understand there’s quite a lot of manual work involved. According to Hillary’s quote above, this work would be done between 10th and 15th, which feels like a pretty short time span to me. We will see…

2 Likes

I think the dates in github are more up to date, they are updated very recently. Also Hillary’s statement was tentative.

That would mean more campaign time for us :grinning:

3 Likes

Hey everyone,

Our planned dataset release in December has to be rescheduled for January 2022.

We are welcoming three new engineers to the Mozilla Foundation in January, who will be dedicated to Common Voice. Until we bring on these new team members, we don’t have the capacity to release the dataset.

We apologize for the inconvenience this has caused, but expect that 2022 will be an exciting year, with lots of improvements, and more frequent releases.

Thank you,

Hillary & EM

6 Likes