Are splits currently still regenerated per version?

MarkHart · June 26, 2023, 4:57pm

It just came to my attention that the Common Voice splits used to be regenerated every version. Is this still the case in the current version, if so, is there any way this can be reconsidered?

The big problem is that once a model is trained on a specific version of Common Voice, there is no way to test that model with Common Voice once a new version is created, as you cannot download the old dataset nor can you be certain that files in the current test set were not in train at the time of training.

Practical implication is that you could use CV for either training or testing, but not both.

bozden · June 27, 2023, 1:04am

Hey @MarkHart, welcome…

Is this still the case in the current version

Yes.

you cannot download the old dataset

Actually, you can. For a while older than two dataset versions were not available, but they came back. Pls. check the datasets menu.

nor can you be certain that files in the current test set were not in train at the time of training
Practical implication is that you could use CV for either training or testing, but not both.

Unless you re-train your model…

For such a workflow, one can take the old dataset’s test.tsv, subtract it from validated.tsv from the newer version, and re-split the remaining as train + dev.

For further fine-tuning the model with the new data, one could use the delta version (available for download). Delta only contains new data, so you can re-split validated as train+dev and test against the older version’s test split. If that makes sense…

In my opinion, with 4 releases in a year, many datasets grow to some degree, and keeping a steady test set might not be logical (e.g. it will drop in size%). Currently, the test set contains the least recorded people’s recordings, but in the next version, they can continue to record and move to the dev or further into the train set. But as many people come, try, and go, the test set does not change a lot.

In any case, for many languages (<90-100k recordings) the default splits are not enough anyway.

MarkHart · June 27, 2023, 5:29am

Unless you re-train your model…

Which is a problem with open source models, for example those available on huggingface, you don’t always have the ability to retrain the model. Or when commonvoice is a subset of the total data used, resulting in that retraining would be rather expensive.

keeping a steady test set might not be logical

I don’t mind steady, but I do mind previous train data getting into dev/test.

bozden · June 27, 2023, 5:40pm

Do you have a workable methodology in mind - wrt splits?

MarkHart · June 27, 2023, 6:16pm

Under the assumption that the datasets on average grow between versions one could reuse the old splits and then divide new samples in such a way that the metrics currently used for splitting are as optimal as possible. That way datasets would converge towards the current metrics while not mixing the splits.

But perhaps I am still missing something. I could spend free time on programming/testing a new proposal method, do you know if there is any chance it would be picked up and not be wasted time?

MarkHart · June 27, 2023, 6:35pm

Under the assumption that the datasets on average grow between versions one could reuse the old splits and then divide new samples in such a way that the metrics currently used for splitting are as optimal as possible. That way datasets would converge towards the current metrics while not mixing the splits.

But perhaps I am still missing something. I could spend free time on programming/testing a new proposal method, do you know if there any chance it would be picked up and not be wasted time?

bozden · June 27, 2023, 6:39pm

But perhaps I am still missing something.

It should work for any language of any size (even a couple hundred recordings), any demographic distribution (from diverse people recording few to few people recording many) and for any scenario (dataset small changes, big changes, not changing, even getting smaller because of deletions etc…)…

do you know if there is any chance

No idea. I had a full-fledged/valid proposal - that intentionally disregarded your points, as any methodology which tries to keep the test set constant would have a short lifetime.

It’s been a year since we last spoke about it.

Topic		Replies	Views
Dataset split best practices? Common Voice feedback , dataset	23	4770	December 23, 2019
How are the dev/test/train datasets split? Common Voice dataset	4	2700	March 7, 2019
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	918	May 23, 2019
Speaker ID split between train/test/dev Common Voice dataset	4	1023	February 15, 2019
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2222	April 3, 2018

Are splits currently still regenerated per version?

Related topics