Differences between data from Huggingface dataset and download dataset?

greenw0lf · May 27, 2023, 1:52pm

Hi everyone,

I was wondering if there are differences in the dataset splits (dev, train, test) between the Huggingface dataset version and the one you can download from https://commonvoice.mozilla.org/. I did some checks myself as I am using the Frisian subset for some experiments I am conducting for my Master thesis, but I am not sure if the results I get are correct because I am using data from both a downloaded version of the dataset and the Huggingface one.

Can someone else confirm if the datasets are the same between the 2 different sources? Thanks in advance.

bozden · May 27, 2023, 11:57pm

Hey @greenw0lf, isn’t it enough just to diff the split files as you already have both versions?

jesslynnrose · May 30, 2023, 1:10pm

Hello! What comparing the two datasets might be the best way to answer this question immediately, I’m so sorry that asking the Hugging Voice team where and how their re-hosted datasets are likely to divert from the Common Voice canonical datasets might be the best source of information.

I’m so sorry that myself and the MCV team don’t have any insight into how rehosted versions of the datasets might differ.

bozden · May 30, 2023, 10:17pm

I just downloaded the split files from the following link and compared them to the ones I have, they are identical. Also confirmed that for Turkish. I think it will be the same for all locales as it is an automated process.

greenw0lf · May 31, 2023, 7:04am

@jesslynnrose Thank you for your answer. Sad to hear that there’s no official way so to say to confirm if the rehosted versions are the same

@bozden Thank you very much for investigating the dataset as well. Maybe I should have clarified that I also checked myself on Common Voice 8 if the two sources have identical information, but I did so using pandas DataFrames and I wasn’t sure if the answer is fully correct. Nevertheless, I wanted a sanity check and I am happy to see that, at least for CV 13, the 2 sources have the same files. Cheers!

Topic		Replies	Views
Load validated split Hugging Face data? Common Voice issue , dataset	11	2966	June 5, 2024
Discrepancy in Hours Between Common Voice Datasets Page and Hugging Face Download Common Voice dataset	3	651	August 12, 2024
Dataset versions Common Voice dataset	4	1056	June 12, 2021
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	956	May 23, 2019
Older English dataset question Common Voice dataset	6	1526	June 15, 2021

Differences between data from Huggingface dataset and download dataset?

Related topics