V6.1 is masquerading as a tar file when it is actually a tar.gzip file

makoto_wada_jp · June 20, 2022, 12:36am

When you download Common Voice Corpus 6.1 dataset, the file name convention is 〈language〉.tar. However it really isn’t a tar file but a gzip file (as ht file extention suggest). I have not checked all languages and versions permutations but version 6.1 and before seems to have .tar as its file extention while in actuality it is a gzip file. This is misleading (and actually a little annoying). I think it should be .tar.gz as in newer version of Common Voice Corpus (e.g. Common Voice Corpus 7.0, Common Voice Corpus 8.0, Common Voice Corpus 9.0).

bozden · June 20, 2022, 5:49pm

Yes, unfortunately, the file naming is different across languages and across versions. Before v5.1 they were different, with that version it became regular for Turkish: cv-corpus-6.1-2020-12-11-tr.tar.gz

Check old German versions for example. They are all de.*

It becomes a problem if you are working on the timeline (like I’m doing) through scripting.

makoto_wada_jp · June 21, 2022, 11:01am

@bozden Thanks for the comment. I have not checked old German databases but I take your word for it. I understand and sympathise with your problem. Consistency is very important for any database … that and good intuitive documentation, so it is really unfortunate. Maybe that is why I do not see a lot of research paper that uses CommonVoice. However, I guess beggers (those who desperately need data) can not be chosers.

Lastly, I apologies for my original comment above which had an error. I can not edit it now (perhaps because there is your reply, which I am grateful for) but here is what I meant to say:

[Error] it really isn’t a tar file but a gzip file (as ht file extention suggest)
[Fix] it really isn’t a tar file (as the file extention suggest) but a gzip file
[Error] I have not checked all languages and versions permutations
[Fix] I have not checked all permutations (all language and version patterns)

bozden · June 21, 2022, 2:38pm

You should open an issue on github if you cannot reach the data. I don’t know the language, so I cannot check.

Topic		Replies	Views
Common Voice by Mozilla Common Voice issue	3	905	June 19, 2021
Common Voice Dataset format Common Voice	3	469	July 1, 2021
Looking for Common Voice Corpus English before 2019-02-25 (v1) release Common Voice	6	898	June 21, 2021
Versioning the datasets Common Voice	5	561	March 31, 2020
Encoding of CommonVoice Greek Dataset Common Voice issue	3	514	June 23, 2021

V6.1 is masquerading as a tar file when it is actually a tar.gzip file

Related topics