V6.1 is masquerading as a tar file when it is actually a tar.gzip file

When you download Common Voice Corpus 6.1 dataset, the file name convention is 〈language〉.tar. However it really isn’t a tar file but a gzip file (as ht file extention suggest). I have not checked all languages and versions permutations but version 6.1 and before seems to have .tar as its file extention while in actuality it is a gzip file. This is misleading (and actually a little annoying). I think it should be .tar.gz as in newer version of Common Voice Corpus (e.g. Common Voice Corpus 7.0, Common Voice Corpus 8.0, Common Voice Corpus 9.0).

Yes, unfortunately, the file naming is different across languages and across versions. Before v5.1 they were different, with that version it became regular for Turkish: cv-corpus-6.1-2020-12-11-tr.tar.gz

Check old German versions for example. They are all de.*

It becomes a problem if you are working on the timeline (like I’m doing) through scripting.

1 Like

@bozden Thanks for the comment. I have not checked old German databases but I take your word for it. I understand and sympathise with your problem. Consistency is very important for any database … that and good intuitive documentation, so it is really unfortunate. Maybe that is why I do not see a lot of research paper that uses CommonVoice. However, I guess beggers (those who desperately need data) can not be chosers.

Lastly, I apologies for my original comment above which had an error. I can not edit it now (perhaps because there is your reply, which I am grateful for) but here is what I meant to say:

  • [Error] it really isn’t a tar file but a gzip file (as ht file extention suggest)
    [Fix] it really isn’t a tar file (as the file extention suggest) but a gzip file
  • [Error] I have not checked all languages and versions permutations
    [Fix] I have not checked all permutations (all language and version patterns)

You should open an issue on github if you cannot reach the data. I don’t know the language, so I cannot check.