Discrepancy in Hours Between Common Voice Datasets Page and Hugging Face Download

I’ve encountered a discrepancy that I’m hoping someone can clarify. On the Common Voice datasets page, it’s specified that the Luganda dataset contains over 467 hours of data. However, when I downloaded the dataset via Hugging Face, the splits I received were significantly smaller: 117 hours for training, 21 hours for testing, and 21 hours for validation which does not add upto 120 hours.

Does anyone know why there is such a large difference between the total hours shown on the Common Voice page and the hours available for download through Hugging Face? Is there additional data that isn’t included in the splits provided by Hugging Face?

Hey @Kasule_John_Trevor, welcome. Let me try to explain the durations:

Here is the total info from v18.0: Total: 560h, Validated: 437h (a bit rounded figures, actuals are 559.230 and 435.398 respectively). The difference is (a) invalidated recordings (65.196h), (b) not-yet-validated recordings = not enough votes yet (~58.637).

But the default splits do not use the whole validated data, so (except an edge case which never happens) validated != train + dev + test. And here is why:

The splits are generated with the CorporaCreator repo, which makes sure voices are distinct in splits, and only takes a single recording per sentence. So, if a sentence is recorded more than once, only one is taken. This is done so to ensure that there is no transcript bias. Think this: There are now 128 languages and they all have different characteristics, low/high amount of text-corpus, volunteers, few volunteers recording too much etc. So CorporaCreator tried to balance all of these.

The edge case: Unlimited sentences (large text-corpus), no sentence gets recorded more than once, many many users who record few sentences. That would use the whole validated in training.

It is very limiting, especially for low-resource languages, so we usually don’t use the default splits, we rather re-split it. A safe approach would be to use CorporaCreator with -s 5 option (max 5 recordings per sentence), or you can devise other splitting algorithms to use the whole validated.

Here is the distribution of recordings for Luganda in v18.0:

As you can see, most of the sentences are recorded more than once, thus they are not included. This usually happens at the very start of the dataset, where only the required amount (5k sentences) is added, and many people record these, multiple times. In the first version of Luganda (v10.0), there were 467 people (?) recording ~300k recordings. Although there were ~90k sentences, that was not enough.

The splitting algorithm really depends on your model and procedure (e.g. are you fine-tuning Whisper, which is robust to such biases?), you should use validated.tsv and re-split them.

You can see the analysis of your datasets and several splitting algorithms in the following webapp (only beta site is updated for v18.0 for now). You can also reach the Google Drive link to download pre-done splits from there. You can just DL them, overwrite train/dev/test, and train using them. I think you will get much better results with the v1 algorithm.

https://cv-dataset-analyzer.netlify.app/

1 Like

Thank you for the help. That was very helpful.

What is the relationship between the dev and validation splits?

I’m not sure if I could understand the question correctly, but here you go: In general ML terminology, “dev split” and “validation split” are used interchangeably. E.g. Common Voice uses the data in dev.tsv to validate the knowledge the NN gets from train split during the forward pass. Other libraries might name it “validation.tsv” serving the same purpose.

Do not mix it with validated.tsv, it is all validated recordings, from which other splits come out (i.e. train, dev, test). CV has validated.tsv because it also releases invalidated and not-yet-validated ones (invalidated.tsv and other.tsv).