Delta releases

Hello everyone, I’d like to announce that as of the next release of Common Voice in October, we will be providing delta release downloads in addition to the full release dataset downloads.

What is a delta release? Let’s imagine you had downloaded Common Voice 10 for Catalan, a total download size of 49 GB, now Common Voice 11 is released and you want to get all that amazing new data. A delta release allows you to download only the new data, that is the difference between the Common Voice 10 release and the 11 release.

Why are you doing this? We had a lot of feedback from the community that the large download sizes were a problem for many users and developers wanting to use the dataset. Not just in terms of the download size (which can take a long time for people on slow connections), but also the fact that downloading a large amount of data can be difficult on unstable connections too.

Will you be offering deltas between all the releases? We will be offering delta releases between the last release and the latest release on a rolling basis. So at some point after the next release (Common Voice 11 in October) we will make available the delta between 10 and 11. At the release following that (Common Voice 12) we will make available the delta between 11 and 12.

When will the delta releases be available? The first delta release will be coming out in mid to late October. We’ll keep you posted!

What does the data look like? It is in exactly the same format as the full release, only the TSV files and clips/ directory only contain the new data, not all the data.

Will it be split in the same way? The delta releases will not be split. Depending on your use case, you can choose to add all the data to your training set, or you can merge with the previous version and use CorporaCreator to create a version that will be identical to the full Common Voice 11 release.

Can we still access older versions of the dataset? Yes, but the way you access them will change. You will be able to download the latest dataset, the one before, and the delta of clips between the two. We will no longer support downloading the oldest versions of the Common Voice datasets directly from the platform, however you can always email us to access an older version of the dataset. This is partly for cost reasons, but also part of our commitment to be thoughtful about the environmental impact of our platform.

Please feel free to ask any questions you may have about the delta releases on Discourse or Matrix - and the team will endeavour to answer them!

3 Likes

Can people still request deletion of their data from the datasets?
Using manual additions would not delete them.

1 Like

Yes they can still request that.

1 Like