Add Basque to the dataset page

Basque speakers are making recordings and validating them. We already recorded (and validated) about 6 hours. Next week we will start the public recording phase of the project with a press conference and the first Basque Common Voice marathon. We are organizing more public sessions in other locations and other activities to boost the recordings…

Is it possible to make Basque recordings downloadable like other languages? We see that languages with less recorded hours than Basque have the dataset downloadable (Irish, Slovenian, Hack Chin) and we would like to say publicly that Basque is fully launched on Common Voice.

The languages currently available for download are the ones that last year were already launched and had validated hours before we started the dataset creation later last year (Basque or Spanish for example were not even launched by that time)

Dataset releases are really time consuming for the team and we are still trying to figure out how to be able to release them more often (I’ve added this as a todo on my list to discuss with the team).

You can say with confidence that Basque is fully launched on Common Voice, but that dataset are currently released not that often and you have just started collecting voices.

1 Like

Just out of curiosity, what part of a dataset release is time-consuming? I would have thought it was largely if not entirely automatable. Is there some kind of manual validation process?

@kdavis and @josh_meyer can probably provide more details here, my understanding is that there was manual clean up involved last time we did it.

Yeah, I guess even with the validation process some bad ones are still going to slip through.

Have you thought about having a regularly scheduled pre-release, similar to Firefox’s nightly builds? Maybe a script runs once a week and packages them all up, then there are formal “stable” releases every few months. Those downloading the cutting-edge releases can help to flag up the problem clips.

We had a pre-release last time. The community helped some clean up some data, e.g. Welsh and German. However, the vast majority of the work was still on our shoulders.

Today we have released a new version of the dataset and keep improving the automation of the process.