I love this dataset, but I am concerned about an aspect of the dataset split practices. It looks as though the splits that come with the downloads are re-generated as new data is made available. If this is not the case then please let me know since this is a very important on two fronts.
-
It raises contamination issues for models that have already been trained and are being topped off as more data is made available.
-
It makes a comparison between different models a challenge since there are not definitive training/validation sets available. To be more concrete, if I wanted to release a model so that others could build off of it or test against it I now have to publish exactly the training utts used in order to avoid contaminating validation/test results.
I see two paths, either preserve splits between revisions or don’t provide splits in the download and make it clear that there are no official splits.
Beyond the code changes required to the corpuscreator, preserving the splits presents a problem if the current split algorithm is used. Assuming that datasets grow over many iterations, the current algorithm will disproportionately select from the early utts. This would be fine if all utts contributed are equally representative of all sentences in the future. However, I highly doubt this condition holds true since language changes over time and earlier contributions are often significantly different to later contributions in any long term dataset if for no other reason than early adopters are generally outliers. If splits are preserved I would highly recommend switching to a simple percentage mechanism with the percentage amounts tied to the assumption that a the dataset is expected to grow to at least n utts. I would also create a new field ‘version’ to denote when an utt joined the dataset.
With that being said, I don’t recommend leaving the splits out. This dataset is incredibly important to academic and industry work and as such creating published, definitive, splits will enable both of those areas to collaborate more easily and enhance the value of this dataset.
Is this a concern others have?