Updates to dataset download options

Hi everyone,

Just a quick note to let you all know that Release 38 is now on production, and with it there are two changes to the dataset download page that I want to highlight.

  1. Every multilingual dataset we’ve ever released since the beginning of 2019 is now available for download. We know having access to old dataset versions is something the community has been requesting for a long time so hopefully this change will allow you all to do more robust training and comparisons against historical data.

  2. Dataset files must now be downloaded directly from the dataset download page. The increasing popularity of Common Voice means the proportion of direct link accesses bypassing our site has been increasing significantly. This both significantly increased our hosting costs and made it impossible for us to know who is using our data, which exposes us to data regulation compliance risk and makes it difficult for us to better understand our community.

    If you were previously downloading datasets directly from the Common Voice website, this change will not affect you at all. With the new system, each dataset download link will be valid for 12 hours and then expire, and you will need to generate a new link from the datasets page. If you experience any technical issues with this, please let me know.

And finally, I wanted to reassure everyone that we are currently working on the 2020 H2 dataset release, and we expect that to be ready in mid-December.

Thanks for all your continued support and effort for the project even during maintenance mode, we all appreciate it immensely.

1 Like

Hey! I am trying to download datasets through the webpage but reach an xml page saying: access denied. Any clues as to why?

1 Like

Just a comment as this release is planned there will be a campaign to invite people to contribute?
I don’t think so because the time is not on our side to plan everything but just in case.

So I believe we should focus on validation during the next couple of weeks ?
What’s the precise deadline ?

Hi! I experience the same issue Eliacus have mentioned. AWS returns access denied code when I’m trying to download any dataset.

Sorry about that folks, dataset downloads should be back. Let me know if you experience any other issues.

There are no plans for campaigns or inviting people to contribute, no, we no longer have the community management resources that would’ve allowed us to plan for that. The cutoff date for clips will be 23:59 UTC on Dec 11th.

1 Like

Hi,
I am still having trouble with downloading the german common voice dataset.
If I download diretly from the browser, it is OK. But I want to download the dataset to the server that we are working on. Download is not working on the server.
Please see the attached errorscreen shot.
My guess is that both wget and curl have a url limit. And this one is exceeding that limit. So, wget and curl is truncating the url, thus the download does not work.
Any opions on how to solve this problem?
Any help or gudance will be much appreicated.

Try using -O filename in wget, it could be that the filename is too long rather than the URL.

Thanks Francis. I tried that, I got 403 forbidden error. Actually, yesterday, I googled around on this issue, and someone on CommonVoice had said that they delibarately disabled direct downloads because of economic issues.
So, finally I had to manually upload that large file.
Thanks anyways.

Regards

1 Like

Actually I want to add this post, to correct my wrong diagnosis at the beginning. When I posted my first message, at that time, I thought it was something to do with the long or invalid url. However, it turned out to be nothing to do with the url itself, instead it is about a delibaretely put restriction.

1 Like

just copy the link upto the .tar.gz (ex:wget https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-12.0-2022-12-07/cv-corpus-12.0-2022-12-07-en.tar.gz
)(in the first line of link that is morethan enough) end and then download from wget with copy of the link