Looking through Mozilla’s GitHub repositories, it seems that the sets are generated in the CorporaCreator, in corpus.py lines 95 through 119. On line 95, we sort by “user_sentence_count”. Then on lines 113 through 119, we bin the beginning of the list into test and dev, and the remainder into train. I would guess that this results in the majority of test and dev being guest contributors with just a single sentence, while the majority of registered users are in train. This doesn’t seem like a good way to do it. Am I right in assuming this is the code actually used for the partitioning?
I still think it would be best to use a hash of the user ID for binning, which would automatically give us minimally changing sets: they would grow along with the data, tending to match the overall demographics, but no users would ever jump from set to set between versions.