Data distribution among sets

Looking through Mozilla’s GitHub repositories, it seems that the sets are generated in the CorporaCreator, in corpus.py lines 95 through 119. On line 95, we sort by “user_sentence_count”. Then on lines 113 through 119, we bin the beginning of the list into test and dev, and the remainder into train. I would guess that this results in the majority of test and dev being guest contributors with just a single sentence, while the majority of registered users are in train. This doesn’t seem like a good way to do it. Am I right in assuming this is the code actually used for the partitioning?

I still think it would be best to use a hash of the user ID for binning, which would automatically give us minimally changing sets: they would grow along with the data, tending to match the overall demographics, but no users would ever jump from set to set between versions.

Also, regarding
“we probably don’t want the test set to just reflect the demographic balance of the contributors”, I disagree. The basic purpose of this split is to compare the sets for some metric, most often to detect overfitting, so you want all the sets to be sampled from the same distribution. If you want to do further statistics on various demographics, rebalance, etc. you can always work with subsets of these three sets when training a model.

In an ideal world maybe. But more often test sets are used for benchmarking, to determine who has the best results on a given task on a given dataset.

A benchmark for a given language for ASR should reflect the diversity of the people who speak that language, not just the majority of people who have contributed.

I believe that is one of the aims of the Common Voice dataset, although I could of course be mistaken.

Already the CommonVoice project has an enormous potential role in recruiting contributors from diverse genders, ages, and language varieties. Determining the train/dev/test split strikes me as too late to try to force things, and as falling in the domain of the researchers using the data. In any case, I’ve seen no proposal for rebalancing the sets based on demographics, and that doesn’t seem to be how it’s currently done; rebalancing is not a trivial problem to solve, and there are many ways to do it. I won’t continue insisting, since it is easy to create and share your own split, as is already being done.

In any case, I’ve seen no proposal for rebalancing the sets based on demographics

As far as I understand this is being worked on currently.