Versioning the datasets

sharvil · March 22, 2020, 6:51pm

Would it be possible to version the published datasets? There’s no identifying information within the dataset itself which makes it very hard to know what data I’m actually looking at.

Ideally, there would be a top-level file called VERSION with language, publishing date, and version number included.

nukeador · March 24, 2020, 11:58am

The way we are doing now is the file you download has the following format:

languagecode_hours_date

For example:

es_221h_2019-12-10

Are you looking for something like that also inside the main folder?

sharvil · March 24, 2020, 9:01pm

Yes, I’m looking for a file inside the archive since file names can get lost/changed/munged for a variety of reasons. It’s probably standard practice to download the archive, extract it, and delete the tarball so you’re not storing redundant data. And at that point you’ve lost the version number.

Also, maybe I’m missing something but the file download for English is just called en.tar.gz. The current link is https://[redacted].s3.amazonaws.com/cv-corpus-4-2019-12-10/en.tar.gz so if you were to download it, you’d just get en.tar.gz.

sharvil · March 30, 2020, 6:01pm

bump thread

Any chance we can have version strings included in the dataset?

nukeador · March 30, 2020, 6:30pm

Hi @sharvil

I’ve pinged our dev team about this topic to be able to provide a better answer.

In general, improvement requests are first discussed here, and if there is an agreement with the team, it is prioritized and incorporated into the future dev roadmap.

mbranson · March 31, 2020, 10:01pm

Thanks for the loop in here @nukeador and the question @sharvil. Creating versioning as part of the dataset has been identified as a requirement for improving quality and access of the Common Voice dataset. Timing and scope for implementation are not finalized though, in an ideal world this would happen in the second half of 2020. Thanks for patience and input!

Topic		Replies	Views
How to Access Old Release Version of Dataset? Common Voice dataset	0	614	September 7, 2020
Dataset versions Common Voice dataset	4	1040	June 12, 2021
V6.1 is masquerading as a tar file when it is actually a tar.gzip file Common Voice issue	3	768	June 21, 2022
How to download common_voice_9.0 dataset? Common Voice	3	73	January 21, 2026
Dataset downloads Dutch Common Voice dataset	4	1350	June 12, 2019

Versioning the datasets

Related topics