Updates to dataset download options

Hi everyone,

Just a quick note to let you all know that Release 38 is now on production, and with it there are two changes to the dataset download page that I want to highlight.

  1. Every multilingual dataset we’ve ever released since the beginning of 2019 is now available for download. We know having access to old dataset versions is something the community has been requesting for a long time so hopefully this change will allow you all to do more robust training and comparisons against historical data.

  2. Dataset files must now be downloaded directly from the dataset download page. The increasing popularity of Common Voice means the proportion of direct link accesses bypassing our site has been increasing significantly. This both significantly increased our hosting costs and made it impossible for us to know who is using our data, which exposes us to data regulation compliance risk and makes it difficult for us to better understand our community.

    If you were previously downloading datasets directly from the Common Voice website, this change will not affect you at all. With the new system, each dataset download link will be valid for 12 hours and then expire, and you will need to generate a new link from the datasets page. If you experience any technical issues with this, please let me know.

And finally, I wanted to reassure everyone that we are currently working on the 2020 H2 dataset release, and we expect that to be ready in mid-December.

Thanks for all your continued support and effort for the project even during maintenance mode, we all appreciate it immensely.

1 Like

Hey! I am trying to download datasets through the webpage but reach an xml page saying: access denied. Any clues as to why?

1 Like

Just a comment as this release is planned there will be a campaign to invite people to contribute?
I don’t think so because the time is not on our side to plan everything but just in case.

So I believe we should focus on validation during the next couple of weeks ?
What’s the precise deadline ?

Hi! I experience the same issue Eliacus have mentioned. AWS returns access denied code when I’m trying to download any dataset.

Sorry about that folks, dataset downloads should be back. Let me know if you experience any other issues.

There are no plans for campaigns or inviting people to contribute, no, we no longer have the community management resources that would’ve allowed us to plan for that. The cutoff date for clips will be 23:59 UTC on Dec 11th.

1 Like