Add Basque to the dataset page

txopi · March 30, 2019, 10:56am

Basque speakers are making recordings and validating them. We already recorded (and validated) about 6 hours. Next week we will start the public recording phase of the project with a press conference and the first Basque Common Voice marathon. We are organizing more public sessions in other locations and other activities to boost the recordings…

Is it possible to make Basque recordings downloadable like other languages? We see that languages with less recorded hours than Basque have the dataset downloadable (Irish, Slovenian, Hack Chin) and we would like to say publicly that Basque is fully launched on Common Voice.

nukeador · April 1, 2019, 11:50am

The languages currently available for download are the ones that last year were already launched and had validated hours before we started the dataset creation later last year (Basque or Spanish for example were not even launched by that time)

Dataset releases are really time consuming for the team and we are still trying to figure out how to be able to release them more often (I’ve added this as a todo on my list to discuss with the team).

You can say with confidence that Basque is fully launched on Common Voice, but that dataset are currently released not that often and you have just started collecting voices.

dabinat · April 1, 2019, 6:05pm

Just out of curiosity, what part of a dataset release is time-consuming? I would have thought it was largely if not entirely automatable. Is there some kind of manual validation process?

nukeador · April 1, 2019, 9:30pm

@kdavis and @josh_meyer can probably provide more details here, my understanding is that there was manual clean up involved last time we did it.

dabinat · April 1, 2019, 11:20pm

Yeah, I guess even with the validation process some bad ones are still going to slip through.

Have you thought about having a regularly scheduled pre-release, similar to Firefox’s nightly builds? Maybe a script runs once a week and packages them all up, then there are formal “stable” releases every few months. Those downloading the cutting-edge releases can help to flag up the problem clips.

kdavis · April 2, 2019, 8:06am

We had a pre-release last time. The community helped some clean up some data, e.g. Welsh and German. However, the vast majority of the work was still on our shoulders.

nukeador · June 12, 2019, 11:17pm

Today we have released a new version of the dataset and keep improving the automation of the process.

Topic		Replies	Views
Dataset downloads Dutch Common Voice dataset	4	1367	June 12, 2019
Multi-language Dataset Beta Release Common Voice announcements , dataset	23	5915	April 6, 2020
How can one download the German dataset? Common Voice dataset	3	1061	June 12, 2019
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3940	April 21, 2020
Dataset releases - What's more valuable for you? Common Voice feedback , dataset	9	2372	June 12, 2019

Add Basque to the dataset page

Related topics