Common Voice by Mozilla

Abraham_Kakooza · June 16, 2021, 5:08pm

When I download the luganda dataset, I get an unknown file format from the zipped .tar file
According to the documentation on GitHub - common-voice/cv-dataset: Metadata and versioning details for the Common Voice dataset, I should get a .tar.gz file format.
Please help, am trying to get the text corpus for the luganda language from the dataset??

ftyers · June 16, 2021, 6:08pm

Dear @Abraham_Kakooza. The file is a tar.gz file, you can extract it using tar -xzf. If this doesn’t work, perhaps it has already been extracted, use file to find out what filetype it shows. I was able to extract the Luganda data no problem. Feel free to join us on the Common Voice Matrix channel for real time question and answer.

Abraham_Kakooza · June 19, 2021, 5:12am

Thanks alot @ftyers for this, am going to try it out and let you know if am successful and glad to meet you.

Abraham_Kakooza · June 19, 2021, 5:40am

Thanks @ftyers, it has worked. This was quite helpfull

Topic		Replies	Views
Common Voice Dataset format Common Voice	3	453	July 1, 2021
V6.1 is masquerading as a tar file when it is actually a tar.gzip file Common Voice issue	3	769	June 21, 2022
Please help me Common Voice	3	375	January 17, 2021
How to download common_voice_9.0 dataset? Common Voice	3	79	January 21, 2026
Spanish Common Voice Dataset may contain two broken mp3 clips Español (es)	8	1230	April 15, 2020

Common Voice by Mozilla

Related topics