Hi everyone,
Just a quick note to let you all know that Release 38 is now on production, and with it there are two changes to the dataset download page that I want to highlight.
-
Every multilingual dataset we’ve ever released since the beginning of 2019 is now available for download. We know having access to old dataset versions is something the community has been requesting for a long time so hopefully this change will allow you all to do more robust training and comparisons against historical data.
-
Dataset files must now be downloaded directly from the dataset download page. The increasing popularity of Common Voice means the proportion of direct link accesses bypassing our site has been increasing significantly. This both significantly increased our hosting costs and made it impossible for us to know who is using our data, which exposes us to data regulation compliance risk and makes it difficult for us to better understand our community.
If you were previously downloading datasets directly from the Common Voice website, this change will not affect you at all. With the new system, each dataset download link will be valid for 12 hours and then expire, and you will need to generate a new link from the datasets page. If you experience any technical issues with this, please let me know.
And finally, I wanted to reassure everyone that we are currently working on the 2020 H2 dataset release, and we expect that to be ready in mid-December.
Thanks for all your continued support and effort for the project even during maintenance mode, we all appreciate it immensely.