There may also be a reason why it has been implemented this way that I’m not aware of - for example to force developers to use the default splits.
Why would they force that? Force developers create mediocre models?
The only thing I can think of is this: The Audio field is a replacement of the mp3 files (so the audio data is embedded), thus these take a lot of space. The default splits only consume a part of all the validated, so that would consume less.
If they included validated.tsv as a converted/sharded dataset split:
- That would be pretty large
- It will include the already existing train/dev/test splits, so large amount of data would be duplicated.
- They would need much more processing cycles for the conversion
It certainly is related to their Dataset object design, which tries to make the usage easier, but given the current CV/CC splits, nobody would get meaningful models.
Unless CV changes the default splitting algorithm
Addentum:
I just checked my local sharded data for different splitting algorithms I prepared for whisper fine-tuning for Turkish. Many fields got removed, but those metadata does not take much space anyway… But train+dev are augmented (so size is about x1.9).
- s1 algorithm (default splits: (train + dev)*2 augmented + test)
- compressed: 7.95 GB
- expanded: 84.7 GB
- prep. dur: 1:01:59
- v1 algorithm (includes all validated 80-10-10 division, again with augmentation)
- compressed: 15.2 GB
- expanded: 158 GB
- prep. dur: 1:54:00
These can give some idea about the reasoning I gave above.