Up-to-date dataset download

monitoringe · November 30, 2021, 9:48am

Hi all! How I can get the current dataset for the Uzbek language before the official second time publishing this year?

Why I need it:

looks like the dataset published on the website around 10-15 dates of December every year
Uzbek voice data was significantly replenished after previous publishing (from 0.7 hours to 53 hours)
at December 15 in our country will be Hackathon using the Common Voice dataset

In order to prepare for the Hackathon, we need the current state of the dataset to check that it can be useful to create a valid and useful model with DeepSpeech.

Thanks!

stergro · November 30, 2021, 6:24pm

Unfortunately, this is not true. The dataset gets extracted every year about this time, but normally, they need two weeks of work after that until they publish the dataset. So you can expect the Dataset in the first week of January. Creating the datasets is not trivial and takes more work than just running an export script. I don’t know the details though, this is an issue since the beginning of the project.

I am afraid 53 hours won’t be very useful yet. You can expect a model with a word error rate around 50% if you are lucky. I still encourage you to create a model because the experience will help you later when you have a bigger dataset. Maybe cross training will also improve the results a bit.

You could create a Scorer in the hackathon if the date is fixed. A good Scorer improves the result of the model a lot and it can be created without a audio dataset.

Topic		Replies	Views
Accessing the extended version of a dataset Common Voice participation , issue , dataset	8	1584	December 6, 2021
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3867	April 21, 2020
Add in dataset Sakha language Common Voice dataset	5	1307	April 25, 2019
Dataset downloads Dutch Common Voice dataset	4	1347	June 12, 2019
Cuando estará disponible? Español (es)	4	3540	September 9, 2020

Up-to-date dataset download

Related topics