Data distribution among sets

cjbaker · September 20, 2021, 8:04pm

Looking through Mozilla’s GitHub repositories, it seems that the sets are generated in the CorporaCreator, in corpus.py lines 95 through 119. On line 95, we sort by “user_sentence_count”. Then on lines 113 through 119, we bin the beginning of the list into test and dev, and the remainder into train. I would guess that this results in the majority of test and dev being guest contributors with just a single sentence, while the majority of registered users are in train. This doesn’t seem like a good way to do it. Am I right in assuming this is the code actually used for the partitioning?

I still think it would be best to use a hash of the user ID for binning, which would automatically give us minimally changing sets: they would grow along with the data, tending to match the overall demographics, but no users would ever jump from set to set between versions.

cjbaker · September 20, 2021, 8:17pm

Also, regarding
“we probably don’t want the test set to just reflect the demographic balance of the contributors”, I disagree. The basic purpose of this split is to compare the sets for some metric, most often to detect overfitting, so you want all the sets to be sampled from the same distribution. If you want to do further statistics on various demographics, rebalance, etc. you can always work with subsets of these three sets when training a model.

ftyers · September 20, 2021, 9:07pm

In an ideal world maybe. But more often test sets are used for benchmarking, to determine who has the best results on a given task on a given dataset.

A benchmark for a given language for ASR should reflect the diversity of the people who speak that language, not just the majority of people who have contributed.

I believe that is one of the aims of the Common Voice dataset, although I could of course be mistaken.

cjbaker · September 20, 2021, 9:29pm

Already the CommonVoice project has an enormous potential role in recruiting contributors from diverse genders, ages, and language varieties. Determining the train/dev/test split strikes me as too late to try to force things, and as falling in the domain of the researchers using the data. In any case, I’ve seen no proposal for rebalancing the sets based on demographics, and that doesn’t seem to be how it’s currently done; rebalancing is not a trivial problem to solve, and there are many ways to do it. I won’t continue insisting, since it is easy to create and share your own split, as is already being done.

ftyers · September 20, 2021, 10:09pm

In any case, I’ve seen no proposal for rebalancing the sets based on demographics

As far as I understand this is being worked on currently.

Topic		Replies	Views
Dataset split best practices? Common Voice feedback , dataset	23	4770	December 23, 2019
Speaker ID split between train/test/dev Common Voice dataset	4	1023	February 15, 2019
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	918	May 23, 2019
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2222	April 3, 2018
Common Voice mid-year release - more data, more languages! Common Voice announcements , dataset	20	2509	August 12, 2019

Data distribution among sets

Related topics