Load validated split Hugging Face data?

When loading the data from Hugging Face, it does not seem possible to load the validated split for the Dutch language, as provided in the image below. I use the following lines of code to load the data.

from datasets import load_dataset
load_dataset("mozilla-foundation/common_voice_13_0", "nl", streaming=False)

I would like to load all 86798 instances which can be downloaded from the common voice project itself, using the load_dataset(), but this does not seem possible. Furthermore, Hugging Face provides that the ‘nl’ data set should have this number of instances in the validated split, but I cannot seem to load it? When attempting this for other languages, it does also not provide the option for a validated split.

I’m so sorry that I’m not able to offer support for datasets rehosted by Hugging Voice, but they have a similar community forum available, where they might be able to help?

I posted the question there as well, thanks!

Whew! I didn’t want to feel like you were being sent away, but I wanted to make sure you were asking someone who was best able to help.

1 Like

Thanks, I saw that these forums responded really quickly which was nice, but thanks for the help anyways!

1 Like

Hey @Rik_Raes, I know a bit about HF. I use HF codebase against my already downloaded datasets (downloaded from Common Voice) and my own splits.

I’m not sure I understand your problem/aim, but here is a view into your validated problem:

validated != train + dev + test because they only provide the default splits already provided by Common Voice (CV) CorporaCreator (CC), which does not use all validated recordings because CV/CC tries to provide diverse/non-biased splits as much as possible.

If you want to use validated.tsv to create your own splits, you need to download it from CV, do the splitting, and use HF code to load it from your disk. And without HF caching (.cache/huggingface/*)… You also need to code your own conversion routine, last year they added casting to Audio type.

Ref: This was my trial/fail with this workflow, they added the casting lately, so I can use it to fine-tune whisper for example:

If you have other Q, you may ask privately, or better there:

I did some more digging on this issue - documented here on the HF forum. The short answer is that the Python layer that HF puts over the top of the CV dataset does not include the validated.tsv data file, and there is therefore no validated split available in the load_datasets() API call.

I can see why developers would want the validated data as something to work with, rather than the default splits. But fixing this sits with Hugging Face.

Is this something that @jesslynnrose would do - Jess I don’t want to give you more work, but the issue here is with Hugging Face’s implementation of the CV data, not on the CV side.

There may also be a reason why it has been implemented this way that I’m not aware of - for example to force developers to use the default splits.

There may also be a reason why it has been implemented this way that I’m not aware of - for example to force developers to use the default splits.

Why would they force that? Force developers create mediocre models?

The only thing I can think of is this: The Audio field is a replacement of the mp3 files (so the audio data is embedded), thus these take a lot of space. The default splits only consume a part of all the validated, so that would consume less.

If they included validated.tsv as a converted/sharded dataset split:

  1. That would be pretty large
  2. It will include the already existing train/dev/test splits, so large amount of data would be duplicated.
  3. They would need much more processing cycles for the conversion

It certainly is related to their Dataset object design, which tries to make the usage easier, but given the current CV/CC splits, nobody would get meaningful models.

Unless CV changes the default splitting algorithm :upside_down_face:

Addentum:

I just checked my local sharded data for different splitting algorithms I prepared for whisper fine-tuning for Turkish. Many fields got removed, but those metadata does not take much space anyway… But train+dev are augmented (so size is about x1.9).

  • s1 algorithm (default splits: (train + dev)*2 augmented + test)
    • compressed: 7.95 GB
    • expanded: 84.7 GB
    • prep. dur: 1:01:59
  • v1 algorithm (includes all validated 80-10-10 division, again with augmentation)
    • compressed: 15.2 GB
    • expanded: 158 GB
    • prep. dur: 1:54:00

These can give some idea about the reasoning I gave above.

1 Like

I agree with you @bozden, I cannot see a reason why validated has been removed from Hugging Face. The default splits are problematic - as we’ve seen in many previous discussions…

Thanks for the help guys, @bozden your tips helped me to obtain a solution for this.

1 Like

Thanks for this amazing information. It helped a lot me.

1 Like