Load validated split Hugging Face data?

Rik_Raes · May 30, 2023, 9:36am

When loading the data from Hugging Face, it does not seem possible to load the validated split for the Dutch language, as provided in the image below. I use the following lines of code to load the data.

from datasets import load_dataset
load_dataset("mozilla-foundation/common_voice_13_0", "nl", streaming=False)

I would like to load all 86798 instances which can be downloaded from the common voice project itself, using the load_dataset(), but this does not seem possible. Furthermore, Hugging Face provides that the ‘nl’ data set should have this number of instances in the validated split, but I cannot seem to load it? When attempting this for other languages, it does also not provide the option for a validated split.

jesslynnrose · May 30, 2023, 12:33pm

I’m so sorry that I’m not able to offer support for datasets rehosted by Hugging Voice, but they have a similar community forum available, where they might be able to help?

Rik_Raes · May 30, 2023, 12:36pm

I posted the question there as well, thanks!

jesslynnrose · May 30, 2023, 12:43pm

Whew! I didn’t want to feel like you were being sent away, but I wanted to make sure you were asking someone who was best able to help.

Rik_Raes · May 30, 2023, 1:07pm

Thanks, I saw that these forums responded really quickly which was nice, but thanks for the help anyways!

bozden · June 5, 2023, 2:26pm

Hey @Rik_Raes, I know a bit about HF. I use HF codebase against my already downloaded datasets (downloaded from Common Voice) and my own splits.

I’m not sure I understand your problem/aim, but here is a view into your validated problem:

validated != train + dev + test because they only provide the default splits already provided by Common Voice (CV) CorporaCreator (CC), which does not use all validated recordings because CV/CC tries to provide diverse/non-biased splits as much as possible.

If you want to use validated.tsv to create your own splits, you need to download it from CV, do the splitting, and use HF code to load it from your disk. And without HF caching (.cache/huggingface/*)… You also need to code your own conversion routine, last year they added casting to Audio type.

Ref: This was my trial/fail with this workflow, they added the casting lately, so I can use it to fine-tune whisper for example:

If you have other Q, you may ask privately, or better there:

kathyreid · May 31, 2023, 5:49am

I did some more digging on this issue - documented here on the HF forum. The short answer is that the Python layer that HF puts over the top of the CV dataset does not include the validated.tsv data file, and there is therefore no validated split available in the load_datasets() API call.

I can see why developers would want the validated data as something to work with, rather than the default splits. But fixing this sits with Hugging Face.

Is this something that @jesslynnrose would do - Jess I don’t want to give you more work, but the issue here is with Hugging Face’s implementation of the CV data, not on the CV side.

There may also be a reason why it has been implemented this way that I’m not aware of - for example to force developers to use the default splits.

bozden · May 31, 2023, 7:26am

There may also be a reason why it has been implemented this way that I’m not aware of - for example to force developers to use the default splits.

Why would they force that? Force developers create mediocre models?

The only thing I can think of is this: The Audio field is a replacement of the mp3 files (so the audio data is embedded), thus these take a lot of space. The default splits only consume a part of all the validated, so that would consume less.

If they included validated.tsv as a converted/sharded dataset split:

That would be pretty large
It will include the already existing train/dev/test splits, so large amount of data would be duplicated.
They would need much more processing cycles for the conversion

It certainly is related to their Dataset object design, which tries to make the usage easier, but given the current CV/CC splits, nobody would get meaningful models.

Unless CV changes the default splitting algorithm

Addentum:

I just checked my local sharded data for different splitting algorithms I prepared for whisper fine-tuning for Turkish. Many fields got removed, but those metadata does not take much space anyway… But train+dev are augmented (so size is about x1.9).

s1 algorithm (default splits: (train + dev)*2 augmented + test)
- compressed: 7.95 GB
- expanded: 84.7 GB
- prep. dur: 1:01:59
v1 algorithm (includes all validated 80-10-10 division, again with augmentation)
- compressed: 15.2 GB
- expanded: 158 GB
- prep. dur: 1:54:00

These can give some idea about the reasoning I gave above.

kathyreid · May 31, 2023, 7:48am

I agree with you @bozden, I cannot see a reason why validated has been removed from Hugging Face. The default splits are problematic - as we’ve seen in many previous discussions…

Rik_Raes · June 5, 2023, 2:27pm

Thanks for the help guys, @bozden your tips helped me to obtain a solution for this.

Andrew_Naylor · February 25, 2024, 7:59am

Thanks for this amazing information. It helped a lot me.

bozden · June 5, 2024, 1:29pm

I just wanted to inform anybody reading this: As it is requested many times and it is logical, the HF staff decided to add validated.tsv into their releases.

Here is the related PR:
https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/discussions/9/files