Why train.tsv includes a few files (just 3% of validated set)?

Hi! Could you please explain, why in the new CV dataset train.tsv has just 12136 files (lines), when validated.tsv has 490484 files (lines). The questin is why train set is so small, when validated set is so huge? I think, that train set should include 99% of validated set, isn’t it?

@gregor @kdavis Do either of you have an answer to this?

Which language are you talking about?

@kdavis It’s for English, I’m also curious about this.

@voicecontrib I’m looking at various datasets and it seems to me that’s a pattern, in each language the validated.tsv is bigger than the train.tsv file.

It will always be true that

train.tsv < validated.tsv

as the validated clips have to be distributed among train.tsv, dev.tsv, and test.tsv.

What’s of note is that

train.tsv + dev.tsv + test.tsv < validated.tsv

The reason is that many sentences were read multiple times by different people, and in machine learning you do not want a training set to be biased or it will result in a biased model trained from the biased training data set. So repeats of read sentences, even though validated, were removed when validated.tsv was distributed among train.tsv, dev.tsv, and test.tsv.

As an extreme but illustrative example, assume that there was only a single sentence “Hello world” and everyone read this same sentence. So, train.tsv, dev.tsv, and test.tsv would contain only people reading this one sentence.

In this case a “speech-to-text engine” would have perfect accuracy if it just did the following

print("Hello world")

The obvious problem in this case is that such a “speech-to-text engine” would not generalize to be useful for other sentences.

A similar but less extreme case arises with repeated sentences. If there are repeated sentences in the training set, then an engine trained on these repeated sentences will be biased to always try and “hear” these sentences.

However, if no sentences are repeated, as is the case with the Common Voice data set, then an engine trained on such a data set will not be biased and will actually learn to go from speech to text without the biases of the above examples.

But if no sentences are repeated in train.tsv, dev.tsv, and test.tsv created from a validated.tsv with repeated sentences, then

train.tsv + dev.tsv + test.tsv < validated.tsv

1 Like

Interesting… I was not aware that repeated sentences were filtered out.

That presents a bit of a problem, because the size of the corpus for English was reduced in December 2017 (which is around where we are now in the validation queue) and didn’t really have many big changes until Sentence Collector launched early this year.

So what that means is that we have a year’s backlog of clips to validate that probably won’t even be used.

I’m not sure I understand what you are saying.

Are you saying/claiming that all English clips not marked as valid or invalid, i.e. “other” clips, are repeats of sentences already in the validated data set? If so, do you have numbers to support this?

Oh I should add that validated.tsv contains all the validated clips, including repeats. So you are free to do whatever you want with the repeated clips.

However, as is required for benchmark data sets, of which Common Voice is now one, we have to release a clean, canonical train, dev, test split to allow researchers to benchmark their algorithms, and this is what we have done.

I don’t have data to support it, but I have validated over 80,000 clips and a large number of them were repeats. And that was before the corpus was reduced in size.

But it wouldn’t be difficult for someone to write a query to figure it out for sure.

Oh dear, oh dear. That potentially means that a lot of volunteer time is being used unproductively. When validating recordings I find, like @dabinat, that a significant proportion are repeats (ie the same text recorded several times by the same person, or by several different people). Difficult to guess how many, but I’d say at least 10%. It’s a surprise to find that repeated recordings are not wanted, as volunteer readers have been presented with clips to read multiple times since the project started, and that’s still happening now (see https://github.com/mozilla/voice-web/issues/1807). Would it be sensible to pause on the validation of recordings until someone’s able to run a script to remove the repeats? If this is indeed an issue, should it be opened on GitHub?

1 Like

I know that when recording, the site shows sentences with the fewest recordings. I think validation should work in a similar way: prioritize unique recordings above repeats.

@kdavis So if I am getting this right the train, test and dev tsv-files are so small because there are so many duplicates? This means that only 5% of validate.tsv made it into the train/test/dev split:

$ cat train.tsv test.tsv dev.tsv | wc -l
26170

$ cat validated.tsv | wc -l
490484

The thing you said about repeated sentences:

A similar but less extreme case arises with repeated sentences. If there are repeated sentences in the training set, then an engine trained on these repeated sentences will be biased to always try and “hear” these sentences.

Do you know how big this danger is? Could you reference any literature? I would have guessed that repeated sentences by different people might contribute some variance which might be useful to get a more robust model. Considering accents or mistakes which could be autocorrected by the language model etc.

Regarding the sentences presented to the readers:

It seems that the text-corpus is too small. Wouldn’t it be possible to present sentences just once?

It is my current understanding that the train/dev/test sets are completely re-generated each release with no guarantee that the previous split data will be reflected so I would caution against using the released splits as an academic source. See this thread: Dataset split best practices?

1 Like

It’s obvious the danger is there. The question as to how “big” the danger is is hard to quantify, and I don’t have specific literature to point to.

With the current release we’ve been as conservative as possible, no repeats in train.stv, dev.tsv, and test.tsv. However, we have also included validated.tsv. So if you, or anyone else, wants to include repeats in their results they are free to do so.

I agree with you in regards to some repeats providing a somewhat more “robust” model. However, as you can clearly see, the number of repeats is a grey zone. Are 3 repeats OK, but 4 too many? To avoid that debate altogether we allowed no repeats and included validated.tsv to allow people, if they so desire, to include as many or as few repeats as they desire.

I agree.

It would but this is not how the site is currently programmed.

Thank you for your reply @kdavis.

What you say makes sense. In the end I think we might just haveto try and evaluate what works best.

Maybe we see a larger pool of sentences in the future in order to avoid this danger altogether :slight_smile:

I stumbled on this and it is kind of concerning regarding the potential amount of effort that could be being wasted.

Can I check a couple of points (in case I’ve got them wrong):

  1. If a particular speaker submits a repeat that they’ve submitted before, it overwrites their earlier one (since they get the same sentence hash). This always seemed wasteful as the submission is lost. I may have misinterpreted the code but I believe this is how it works and I observed this early on with my local hosted version.
  2. based on @kdavis point above about the need to use a sentence just once within test/dev/train, this means that only a single person’s recording (out of every recording of that sentence submitted by anyone) will get put test/dev/train. This isn’t quite as bad as 1 above, as at least the recording gets kept in validated (so may be used for particular scenarios where the duplicate issue isn’t a problem)

The concern I have is that this could put people off contributing if they see so much that they do won’t end up training a speech recognition model.

So far as I can see, the only way around the waste (so far as using for speech recognition) would be to have a super diverse set of sentences and have each one used just once.
Or figure out some way to ensure it didn’t matter having a certain number of duplicates - presumably if every recording came up twice then they’d have the same ratio and the model wouldn’t be biased into returning a particular result

1 Like

What we are trying to solve now is making sure most languages have enough sentences so participants will never be presented with something that has already been recorded.

Once we have that, I think we will mitigate most of the concerns I’m reading here.

1 Like

I’m still seeing some duplicates being recorded in English but the number is considerably smaller now that we have the wiki sentences. I don’t really consider it any issue anymore.

The bigger concern for me is how many duplicates were recorded in the past. Not knowing that number makes it hard to tell exactly how close we are to the goal of 10,000 hours. It could be a lot further away than the counter on the website suggests.

1 Like

FWIW, the effort wasn’t a complete waste, since appropriate datasets depend a bit on the end goal of the model. Working on production ML, I’m mostly concerned about the /test/ set being well-segregated from the rest of the data (especially being sure that the test speakers are wholly distinct form the train speakers), but, in my particular area, I worry less about repeated sentences in the train set.