Hey @MarkHart, welcome…
Is this still the case in the current version
Yes.
you cannot download the old dataset
Actually, you can. For a while older than two dataset versions were not available, but they came back. Pls. check the datasets menu.
nor can you be certain that files in the current test set were not in train at the time of training
Practical implication is that you could use CV for either training or testing, but not both.
Unless you re-train your model…
For such a workflow, one can take the old dataset’s test.tsv, subtract it from validated.tsv from the newer version, and re-split the remaining as train + dev.
For further fine-tuning the model with the new data, one could use the delta version (available for download). Delta only contains new data, so you can re-split validated as train+dev and test against the older version’s test split. If that makes sense…
In my opinion, with 4 releases in a year, many datasets grow to some degree, and keeping a steady test set might not be logical (e.g. it will drop in size%). Currently, the test set contains the least recorded people’s recordings, but in the next version, they can continue to record and move to the dev or further into the train set. But as many people come, try, and go, the test set does not change a lot.
In any case, for many languages (<90-100k recordings) the default splits are not enough anyway.