Dataset split best practices?

I love this dataset, but I am concerned about an aspect of the dataset split practices. It looks as though the splits that come with the downloads are re-generated as new data is made available. If this is not the case then please let me know since this is a very important on two fronts.

  • It raises contamination issues for models that have already been trained and are being topped off as more data is made available.

  • It makes a comparison between different models a challenge since there are not definitive training/validation sets available. To be more concrete, if I wanted to release a model so that others could build off of it or test against it I now have to publish exactly the training utts used in order to avoid contaminating validation/test results.

I see two paths, either preserve splits between revisions or don’t provide splits in the download and make it clear that there are no official splits.

Beyond the code changes required to the corpuscreator, preserving the splits presents a problem if the current split algorithm is used. Assuming that datasets grow over many iterations, the current algorithm will disproportionately select from the early utts. This would be fine if all utts contributed are equally representative of all sentences in the future. However, I highly doubt this condition holds true since language changes over time and earlier contributions are often significantly different to later contributions in any long term dataset if for no other reason than early adopters are generally outliers. If splits are preserved I would highly recommend switching to a simple percentage mechanism with the percentage amounts tied to the assumption that a the dataset is expected to grow to at least n utts. I would also create a new field ‘version’ to denote when an utt joined the dataset.

With that being said, I don’t recommend leaving the splits out. This dataset is incredibly important to academic and industry work and as such creating published, definitive, splits will enable both of those areas to collaborate more easily and enhance the value of this dataset.

Is this a concern others have?

1 Like

Coming from the industry I do not have that requirement when it comes to reproducibility but I do understand the need, which is why I’d welcome some kind of versioning in case of updates too. It should always be clear what data has been used for a model in order to have a comparison.

My biggest concern is the fact that the provided train/test/dev split removes approx. 95% which is very unfortunate. This is also discussed in this post.

My question No. 1 is at this point: If I want to use the entire dataset, is there a chance that my models get biased because of duplicates? Thanks to the provided client_id it is possible to separate speakers and create disjoined sets in that regard, but this does not solve the “duplicate sentence”-issue.

My assumption is that some duplicates should contribute some variance which might be useful for the model to generalize. However, I don’t want to neglect the argument that a model might bias towards repeated sentences and thus have a negative impact on overall performance. If this is a problem for most (or some models) why was such a small set of sentences used? It should be feasible to present unique sentences - or fix the number of times a sentence gets presented.

1 Like

100% agree. Anecdotally, it has been my experience that as total data increases it becomes less of a problem to have repeated similar information (think augments, not different speakers/utts). It may be that as the number of unique sentences and voices grow there should be an allowance in the ‘official’ splits to increase the number of repeated sentences so long as sentences and voices stay within one set only. So, no voice appears in both train/test and no sentence appears in both train/test.

  • No careful academic should, or would, “top up” models with additional data from several Common Voice releases. They would train and report results on distinct Common Voice Releases, e.g. 20% WER on Common Voice 2.0. If they “topped up” their models on this type of accumulative data set, it’d be invalid research and would be rejected by reviewers.
  • Quite the contrary, comparisons between different models is simple. You simply report a result for a given version of Common Voice, e.g. WER Common Voice 2.0, where there is a canonical train, dev, test split for each language.

PS: Just to check. You have looked as the data? You say several times there are no definitive splits. But each language ships with a definitive split.

While I agree it’s unfortunate; validated.tsv includes all repeats. So if you want to use them for whatever reason, you can.

There is basically a 100% certainty that your model will become biased if you include all the repeats. However, there is a grey zone in which inclusion of some repeats from validated.tsv may help make you model more robust to speaker variation.

As this is a grey zone (Are 3 repeats fine and 4 too many?) we decided to stay clear ambiguities in the official train, dev, and test split and allow users of the data sets to decide based on their application if including some repeats from validated.tsv is warranted.

I agree, but I doubt if any language of Common Voice is of that size yet.

This seems reasonable, but the question is: How does one quantify this?

For example, how big does the data set have to be to allow for 2 repeats and why is this size big enough to allow for 2 repeats? How big does the data set have to be to allow for 3 repeats and why is this size big enough to allow for 3 repeats? And on and on…

The repeat count is definitely a challenge and probably model dependent however I suspect something of the form ceil(c*log(total sample count)) would be appropriate. There are two approaches to consider here. First, the ‘official’ sets should make sure that every utt, including ‘other’ and ‘invalid’ ones, are marked as part of a set. The sets should be randomly generated with version information and splits maintained across releases and the only guarantee being that either speakers or sentences remain in one bin. Split fractions should set based on a final expected minimum size of the dataset. From there tools can be made available to create subsets with different split strategies that maintain the official binning but allow users to add repeats as they see fit. The provided tools could be seed based so that researchers could publish seeds and dataset version information used in generating their datasets which would allow others to reproduce them. Alternatively, ‘recommended’ additioanal splits could be provided with one or two strategies built into them. I am personally not a fan of providing ‘recommended’ sets since maintaining multiple sets would likely become and issue over time. I believe the first strategy would be sufficient and flexible with the least admin overhead and only a slight inconvenience to a few end users while still allowing researchers to publish models that hadn’t been contaminated with validation/test data directly. Given a choice between separating sentences or speakers, I am not sure which would make the most sense but it probably depends on which is higher, the average number of sentences a user generates or the average number of users a given sentence has been read by. Whichever is higher should likely be the one isolated by set.

Why? Writing down a plausible formula isn’t good enough in this case.

Not to be pedantic, but I want to be sure and ask: Have you downloaded the current data set for any language?

Each includes “other” and “invalid” utterances, in addition to “valid” ones, and each includes an official “train, dev, test” split. In addition, as mentioned here, there is an official tool CorporaCreator used to make the splits, based on data sets sizes, which people can use to split the data set in other ways, for example to allow for more repeats.

PS: We separated sentences and speakers, in other words a sentence is exclusively in either train, dev, or test. Similarly, a speaker is exclusively in either train, dev, or test.

I totally agree that a plausible formula isn’t good enough. That is why I said I wasn’t in favor of providing ‘recommended’ sets in the download.

Yes, I have, several. The header in ‘other.tsv’ and ‘invalid.tsv’ is:

client_id path sentence up_votes down_votes age gender accent

Am I missing where the split is maintained? I have seen reference to a ‘clips.tsv’ file that had a field with this information in it but I don’t see it in any of the language downloads. Is there a different set of downloads I should be looking at than the ones at https://voice.mozilla.org/en/datasets? I think every utt in the invalid, other and validated datsets should be binned. This because I am in favor of having the end user do the custom split with their own definition of ‘valid’ and number of repeats for speakers/sentences.

Yes, that is a great tool for creating corpora and I probably should have said ‘modify CorporaCreator’ like I did in the original post.

I understand that is what happens right now. The entire point of this thread was that splits should be persevered across dataset releases which will, as far as I currently understand the tools and the processes, not happen now. My discussion point was that the ‘official’ splits should contain all the data and that previous ‘official’ split decisions should be honored in future splits. Recommended subsets with specific properties for ‘valid’ and sentence/speaker repeats, like the ones currently found in the downloads in ‘train.tsv, dev.tsv, and test.tsv’, probably should be created by end users as they see fit based on official binning for every utt. This shouldn’t be a problem since the tool to do it already exists and would only require minor modifications to have it honor pre-defined bins. Making tools, like Corpus Creator, available to perform custom splits with specific properties based on the ‘official’ binning is, I think, a good compromise to preserve previous split information.

They are free to do this, but this is of little use for comparing results across several algorithms from different groups. There needs to be, and there is, an official train, dev, test split.

I’m not sure why you see having no official split as being useful.

This is unfortunately impossible. The GDPR stipulates that speakers can decide to remove themselves from the data set. If they do so, the split can’t be maintained across releases.

Yep agreed for academics, but models in industry likely won’t follow this path. Why burn those GPU cycles? And, agreed, they don’t have to maintain these splits, but there is value in respecting a public split even in a private industry model. For instance, industry still wants to compare, even if just internally, to academia and others in industry even if it isn’t a perfect academic comparison. I agree that this isn’t a make or break argument, but instead just a ‘good to have’ argument. Still, a positive for very little, if any, negative is still a positive.

Yes, you are right. Basic research should strive to ensure all models had the same starting point. I will argue two points here. First, from a practical standpoint, it would be nice to maintain previous split information and have a version flag in the data. This minimizes juggling multiple versions of the dataset and the inevitable format changes that are likely to happen over a long running project. I admit this isn’t a ‘slam dunk’ reason, but ‘practical’ is often the reason a dataset is used in the first place. Second, it is not inconceivable, and actually quite likely, that deriviative datasets will be published for things like emotional content. Any change to the splits in this dataset will ripple through those and untangling datasets between projects could be problematic. Again, a practical consideration and not a make or break argument.

The original point is that those splits, as far as I currently understand the tools and processes, will not be maintained as more data is added and the datasets are re-generated and released so those aren’t ‘definitive’ across releases.

I believe that there are benefits to respecting previous split decisions and that those benefits outweigh the few detractors. I agree that the magnitude of the benefits can be debated, but as far as I can tell this is a fairly easy way to add value to an already incredibly valuable dataset.

Just to repeat what I said in another thread, so it’s not lost here. GDPR makes this impossible. Speakers can request their data is removed from the data set, and once this is done maintaining a split across releases is impossible.

Yes the splits are not “definitive across releases”. They are definitive to a release. The GDPR makes such “definitive across releases” splits impossible.

I’m not sure why you think this. Are you referring to different algorithms for the splits? Assuming that is what you mean, I am all in favor of having the default split in Corpus Creator be the current split criteria, but using an official binning for all utts that is maintained in any subset it creates and not having the default subset splits included in the downloads.

I think that what I have been arguing is that there should be an official train, dev and test but that it should carry forward split decisions from previous releases. If I haven’t been clear on that point I am sorry. When I say ‘recommended’ splits shouldn’t be included I was talking about subsets of that larger set. For example, the current train, test, dev make decisions to increase diversity and minimize errors at the cost of volume, leaving many utts in the ‘validated’, ‘invalid’ and ‘other’ sets undeclared. This is an admirable goal but depending on the use, the requirement for diversity or a tolerance for errors may be higher or lower so why not leave those decision up to the end user when they run Corpus Creator or some other tool to create a custom subset?

That is a great point, but in that case the historic datsets will have to be modified as well if they are going to continue to be public. If anything this is a strong argument to include an ‘errata’ file of utt ids that have been removed and the reason why. It is a lot easier to just make the current version the only available version with a simple errata list and a version tag than to maintain multiple dot releases of previous versions.

Historic data sets will no longer be public, only newer data sets with the appropriate removals will be public.

This is actually not a good idea.

The reason someone wanted their data removed could be personal, e.g. they were being stalked. Including such personal information in an official errata would invite further abuse. So we will definitely not be including such information in any release.

There is a difference between a reductive change to a dataset and allowing sets to swap utts. The original argument I posted, and many of my followups, have been concerned with contamination between tran/dev/test. Removing an utt doesn’t cause contamination and isn’t as big of a concern.

Again, the decision can still be definitive even if the data isn’t available anymore. The overriding idea is that once an utt makes it into one of the sets it can not appear in any other set in the future. If it is removed that condition isn’t violated so there can be definitive splits across releases.

I had no intention of implying that a personal reason be included. ‘removed due to terms of service’ or ‘removed due to repeat’ etc are the types of ‘reasons’ I was implying in the errata list. You don’t even have to say ‘by request of user’, ‘removed due to terms of service’ could be because the utt was found to violate any number of terms, like it was in a commercial dataset and included by mistake. An errata list is not an uncommon thing, there are best practices that can be reviewed and implemented. I recommended an errata list in general, the details of what are in it can be debated but I think it is a good idea.

For some languages with smaller data sets there are only a handful of speakers, e.g. Slovenian has 18, and it’s often the case that only through outreach events that data is collected for such languages.

Say the test set for such a language consists of only 2 speakers and they both decide to remove their data, and there are no new contributors to this language before the next release, as there were no new outreach events.

Your suggestion, if I understand it correctly, would be to release the next version of this data set with the test set empty to maintain the split across releases.

To me an empty test set seems not to be a viable solution.

That doesn’t seem to be a likely scenario even with just 18 speakers, but if it is a concern then set a minimum diversity level before any official split is offered so that the likelihood of that happening is reasonably small. Or, accept that in incredibly low data areas it is possible to have incomplete datasets.

Slovenian has a single person in the training set. So this is not hypothetical.

This would require defining “diversity”. Is it accents? Is it speakers? Is it gender? Is it location? Is it…

Consider defining “diversity”. It would involve repeating the discussion we are having 95 times, or there abouts, once for each language community, using a team the members of which you can count on one hand. We’d be doing nothing else.

We’re not willing to accept this either. It greatly diminishes the value of the data sets and discourages research into these low resource languages, something we explicitly are trying to encourage.

How many GDPR take down requests happen in a year as a percentage of total users? Now using that, what is the probability that that one person out of the 18 contributors to this language will request to be taken off the project? I suspect that probability falls into the ‘unlikely’ category.

I believe the context of the discussion above makes it clear that ‘diversity’ was number of speakers in the dataset. If it wasn’t clear, I am sorry. But sure, have the discussion about the best way to define it if it isn’t obviously number of speakers. I suspect that from this point forward new languages and the current low speaker languages will follow similar user growth and data acquisition curves so it isn’t wasted to discuss how best to integrate them in and present the data.

This project is an amazing resource that will, hopefully, continue to add data for a long time. I truly appreciate the work that goes into planning it and carrying it out and I know first hand the tremendous value it represents to the field.

At this point I am going to pull out of this discussion. If the idea of preserving split decisions is adopted I am more than willing to join the discussion about the best way to do that. I suspect that if the idea isn’t adopted now it will come back again after users see several more releases happen and start to realize they are not consistent with previous releases.

As a final note, I encourage, at a minimum, the idea of an errata file if for no other reason than to be clear about the data I as an end user need to remove and why. (Again, personal reasons shouldn’t be in there, just technical or legal.) I am personally not 100% versed on GDPR but I think there is a requirement to alert organizations that use your data when one of your users opt-out so that they can also remove the data. An errata file is a great, simple, mechanism for doing that.