How are the dev/test/train datasets split?

I have surveyed a few of the languages and one thing I have not found is a good description of how the dev/train/test sets are split. I would be interested to know what the algorithm is. For instance, it looks like early on, where there isn’t a large amount of data in a language, the test and dev sets get nearly equal shares of the data as the train set, but as the data grows the share of data drops and train gets the majority. Is this the case? If so, what are the percentages/cutoffs ?

Heya, there’s some info on that in the CorporaCreator repo: https://github.com/mozilla/CorporaCreator

1 Like

Thanks, that helps although I am still a little confused. I have 1 specific concern and a second question. First, the question. These datasets are growing over time so the sets are re-generated periodically. Are they re-generated to maintain previous splits? This is important for obvious reasons, but I don’t see the re-split strategy explicitly stated anywhere. The concern is, assuming previous splits are maintained and only added to, the strategy appears to highly favor initial samples since the sets were initially generated with initially small datasets. Assuming the nature of the utts don’t change over time this is fine, but it has been my experience that utts like these change, sometimes considerably, over time. Is this being addressed? I know I can generate my own splits from the data, but a large public dataset is an amazing resource to compare against with others so I am trying to avoid making my own splits to ensure any potential future comparisons are done on the same data.

They will be re-generated using the CorporaCreator which will not maintain the previous splits.

1 Like

Too bad, that definitely presents a challenge when using this dataset for comparisons to other public modes or long term training. Knowing that it is probably best not to use the splits that are part of the available download. Thanks for clarifing that.