Coming from the industry I do not have that requirement when it comes to reproducibility but I do understand the need, which is why I’d welcome some kind of versioning in case of updates too. It should always be clear what data has been used for a model in order to have a comparison.
My biggest concern is the fact that the provided train/test/dev split removes approx. 95% which is very unfortunate. This is also discussed in this post.
My question No. 1 is at this point: If I want to use the entire dataset, is there a chance that my models get biased because of duplicates? Thanks to the provided client_id it is possible to separate speakers and create disjoined sets in that regard, but this does not solve the “duplicate sentence”-issue.
My assumption is that some duplicates should contribute some variance which might be useful for the model to generalize. However, I don’t want to neglect the argument that a model might bias towards repeated sentences and thus have a negative impact on overall performance. If this is a problem for most (or some models) why was such a small set of sentences used? It should be feasible to present unique sentences - or fix the number of times a sentence gets presented.