Dataset downloads Dutch

Hi,

To start: thanks for the really great project !

Unfortunately the downloadable datasets don’t have an (visible) publication date.
At the moment I’m wondering how up to date the Dutch dataset is because the download page gives the following stats:
Size 382 MB
Validated Hr. Total 12
Overall Hr. Total 13
Number of Voices 373

However, when I download this dataset i only end up with 366MB.

Apart from that the graph on https://voice.mozilla.org/en shows:
Dutch
Hours Recorded 23h
Hours Validated 18h

So that is a significant difference (1/3 new validated speech !).

So I’m wondering:

  • if the current link for downloadable dataset is actually correct (since there is a size difference 382mb versus 366mb)
  • when the downloadable dataset will be updated (is there a regular interval ?
  • if it would be wise to add a publication date to the dataset stats.

Regards,

Sander

Hi and welcome to the Common Voice community!

See this message for clarification

For the other two, I’m pinging other team members to check, thanks for reporting.

Cheers.

Thanks for the pointer !
It thought it would be an easy feat, but it clearly is not.
I do still have some questions after reading that entry:

  • Are the datasets for all languages revisited at the same time, or independent ?
  • Is there a way to help for the Dutch one ?

We haven’t agreed on a plan yet. I’ll be working on a proposal to deliver to the team so we can have sooner dataset releases based on what’s more helpful for the community. I’ll open a topic about it soon.

1 Like

Today we have released a new version of the dataset and keep improving the automation of the process.