Up-to-date dataset download

Hi all! How I can get the current dataset for the Uzbek language before the official second time publishing this year?

Why I need it:

  • looks like the dataset published on the website around 10-15 dates of December every year
  • Uzbek voice data was significantly replenished after previous publishing (from 0.7 hours to 53 hours)
  • at December 15 in our country will be Hackathon using the Common Voice dataset

In order to prepare for the Hackathon, we need the current state of the dataset to check that it can be useful to create a valid and useful model with DeepSpeech.


1 Like

Unfortunately, this is not true. The dataset gets extracted every year about this time, but normally, they need two weeks of work after that until they publish the dataset. So you can expect the Dataset in the first week of January. Creating the datasets is not trivial and takes more work than just running an export script. I don’t know the details though, this is an issue since the beginning of the project.

I am afraid 53 hours won’t be very useful yet. You can expect a model with a word error rate around 50% if you are lucky. I still encourage you to create a model because the experience will help you later when you have a bigger dataset. Maybe cross training will also improve the results a bit.

You could create a Scorer in the hackathon if the date is fixed. A good Scorer improves the result of the model a lot and it can be created without a audio dataset.