Are splits currently still regenerated per version?

It just came to my attention that the Common Voice splits used to be regenerated every version. Is this still the case in the current version, if so, is there any way this can be reconsidered?

The big problem is that once a model is trained on a specific version of Common Voice, there is no way to test that model with Common Voice once a new version is created, as you cannot download the old dataset nor can you be certain that files in the current test set were not in train at the time of training.

Practical implication is that you could use CV for either training or testing, but not both.

1 Like

Hey @MarkHart, welcome…

Is this still the case in the current version

Yes.

you cannot download the old dataset

Actually, you can. For a while older than two dataset versions were not available, but they came back. Pls. check the datasets menu.

nor can you be certain that files in the current test set were not in train at the time of training
Practical implication is that you could use CV for either training or testing, but not both.

Unless you re-train your model…

For such a workflow, one can take the old dataset’s test.tsv, subtract it from validated.tsv from the newer version, and re-split the remaining as train + dev.

For further fine-tuning the model with the new data, one could use the delta version (available for download). Delta only contains new data, so you can re-split validated as train+dev and test against the older version’s test split. If that makes sense…

In my opinion, with 4 releases in a year, many datasets grow to some degree, and keeping a steady test set might not be logical (e.g. it will drop in size%). Currently, the test set contains the least recorded people’s recordings, but in the next version, they can continue to record and move to the dev or further into the train set. But as many people come, try, and go, the test set does not change a lot.

In any case, for many languages (<90-100k recordings) the default splits are not enough anyway.

Unless you re-train your model…

Which is a problem with open source models, for example those available on huggingface, you don’t always have the ability to retrain the model. Or when commonvoice is a subset of the total data used, resulting in that retraining would be rather expensive.

keeping a steady test set might not be logical

I don’t mind steady, but I do mind previous train data getting into dev/test.

Do you have a workable methodology in mind - wrt splits?

Under the assumption that the datasets on average grow between versions one could reuse the old splits and then divide new samples in such a way that the metrics currently used for splitting are as optimal as possible. That way datasets would converge towards the current metrics while not mixing the splits.

But perhaps I am still missing something. I could spend free time on programming/testing a new proposal method, do you know if there is any chance it would be picked up and not be wasted time?

Under the assumption that the datasets on average grow between versions one could reuse the old splits and then divide new samples in such a way that the metrics currently used for splitting are as optimal as possible. That way datasets would converge towards the current metrics while not mixing the splits.

But perhaps I am still missing something. I could spend free time on programming/testing a new proposal method, do you know if there any chance it would be picked up and not be wasted time?

But perhaps I am still missing something.

It should work for any language of any size (even a couple hundred recordings), any demographic distribution (from diverse people recording few to few people recording many) and for any scenario (dataset small changes, big changes, not changing, even getting smaller because of deletions etc…)…

do you know if there is any chance

No idea. I had a full-fledged/valid proposal - that intentionally disregarded your points, as any methodology which tries to keep the test set constant would have a short lifetime.

It’s been a year since we last spoke about it.