Versioning the datasets

Would it be possible to version the published datasets? There’s no identifying information within the dataset itself which makes it very hard to know what data I’m actually looking at.

Ideally, there would be a top-level file called VERSION with language, publishing date, and version number included.

2 Likes

The way we are doing now is the file you download has the following format:

languagecode_hours_date

For example:

es_221h_2019-12-10

Are you looking for something like that also inside the main folder?

Yes, I’m looking for a file inside the archive since file names can get lost/changed/munged for a variety of reasons. It’s probably standard practice to download the archive, extract it, and delete the tarball so you’re not storing redundant data. And at that point you’ve lost the version number.

Also, maybe I’m missing something but the file download for English is just called en.tar.gz. The current link is https://[redacted].s3.amazonaws.com/cv-corpus-4-2019-12-10/en.tar.gz so if you were to download it, you’d just get en.tar.gz.

bump thread

Any chance we can have version strings included in the dataset?

Hi @sharvil

I’ve pinged our dev team about this topic to be able to provide a better answer.

In general, improvement requests are first discussed here, and if there is an agreement with the team, it is prioritized and incorporated into the future dev roadmap.

Thanks for the loop in here @nukeador and the question @sharvil. Creating versioning as part of the dataset has been identified as a requirement for improving quality and access of the Common Voice dataset. Timing and scope for implementation are not finalized though, in an ideal world this would happen in the second half of 2020. Thanks for patience and input!