Differences between data from Huggingface dataset and download dataset?

Hi everyone,

I was wondering if there are differences in the dataset splits (dev, train, test) between the Huggingface dataset version and the one you can download from https://commonvoice.mozilla.org/. I did some checks myself as I am using the Frisian subset for some experiments I am conducting for my Master thesis, but I am not sure if the results I get are correct because I am using data from both a downloaded version of the dataset and the Huggingface one.

Can someone else confirm if the datasets are the same between the 2 different sources? Thanks in advance.

Hey @greenw0lf, isn’t it enough just to diff the split files as you already have both versions?

1 Like

Hello! What comparing the two datasets might be the best way to answer this question immediately, I’m so sorry that asking the Hugging Voice team where and how their re-hosted datasets are likely to divert from the Common Voice canonical datasets might be the best source of information.

I’m so sorry that myself and the MCV team don’t have any insight into how rehosted versions of the datasets might differ.

1 Like

I just downloaded the split files from the following link and compared them to the ones I have, they are identical. Also confirmed that for Turkish. I think it will be the same for all locales as it is an automated process.



@jesslynnrose Thank you for your answer. Sad to hear that there’s no official way so to say to confirm if the rehosted versions are the same

@bozden Thank you very much for investigating the dataset as well. Maybe I should have clarified that I also checked myself on Common Voice 8 if the two sources have identical information, but I did so using pandas DataFrames and I wasn’t sure if the answer is fully correct. Nevertheless, I wanted a sanity check and I am happy to see that, at least for CV 13, the 2 sources have the same files. Cheers!