Why train.tsv includes a few files (just 3% of validated set)?

dataset
#1

Hi! Could you please explain, why in the new CV dataset train.tsv has just 12136 files (lines), when validated.tsv has 490484 files (lines). The questin is why train set is so small, when validated set is so huge? I think, that train set should include 99% of validated set, isn’t it?

Dataset split best practices?
(LRSaunders) #2

@gweber @kdavis Do either of you have an answer to this?

(kdavis) #3

Which language are you talking about?

(Pedro Lima) #4

@kdavis It’s for English, I’m also curious about this.

(Pedro Lima) #5

@voicecontrib I’m looking at various datasets and it seems to me that’s a pattern, in each language the validated.tsv is bigger than the train.tsv file.

(kdavis) #6

It will always be true that

train.tsv < validated.tsv

as the validated clips have to be distributed among train.tsv, dev.tsv, and test.tsv.

What’s of note is that

train.tsv + dev.tsv + test.tsv < validated.tsv

The reason is that many sentences were read multiple times by different people, and in machine learning you do not want a training set to be biased or it will result in a biased model trained from the biased training data set. So repeats of read sentences, even though validated, were removed when validated.tsv was distributed among train.tsv, dev.tsv, and test.tsv.

As an extreme but illustrative example, assume that there was only a single sentence “Hello world” and everyone read this same sentence. So, train.tsv, dev.tsv, and test.tsv would contain only people reading this one sentence.

In this case a “speech-to-text engine” would have perfect accuracy if it just did the following

print("Hello world")

The obvious problem in this case is that such a “speech-to-text engine” would not generalize to be useful for other sentences.

A similar but less extreme case arises with repeated sentences. If there are repeated sentences in the training set, then an engine trained on these repeated sentences will be biased to always try and “hear” these sentences.

However, if no sentences are repeated, as is the case with the Common Voice data set, then an engine trained on such a data set will not be biased and will actually learn to go from speech to text without the biases of the above examples.

But if no sentences are repeated in train.tsv, dev.tsv, and test.tsv created from a validated.tsv with repeated sentences, then

train.tsv + dev.tsv + test.tsv < validated.tsv

#7

Interesting… I was not aware that repeated sentences were filtered out.

That presents a bit of a problem, because the size of the corpus for English was reduced in December 2017 (which is around where we are now in the validation queue) and didn’t really have many big changes until Sentence Collector launched early this year.

So what that means is that we have a year’s backlog of clips to validate that probably won’t even be used.

(kdavis) #8

I’m not sure I understand what you are saying.

Are you saying/claiming that all English clips not marked as valid or invalid, i.e. “other” clips, are repeats of sentences already in the validated data set? If so, do you have numbers to support this?

(kdavis) #9

Oh I should add that validated.tsv contains all the validated clips, including repeats. So you are free to do whatever you want with the repeated clips.

However, as is required for benchmark data sets, of which Common Voice is now one, we have to release a clean, canonical train, dev, test split to allow researchers to benchmark their algorithms, and this is what we have done.

#10

I don’t have data to support it, but I have validated over 80,000 clips and a large number of them were repeats. And that was before the corpus was reduced in size.

But it wouldn’t be difficult for someone to write a query to figure it out for sure.

(Michael Maggs) #11

Oh dear, oh dear. That potentially means that a lot of volunteer time is being used unproductively. When validating recordings I find, like @dabinat, that a significant proportion are repeats (ie the same text recorded several times by the same person, or by several different people). Difficult to guess how many, but I’d say at least 10%. It’s a surprise to find that repeated recordings are not wanted, as volunteer readers have been presented with clips to read multiple times since the project started, and that’s still happening now (see https://github.com/mozilla/voice-web/issues/1807). Would it be sensible to pause on the validation of recordings until someone’s able to run a script to remove the repeats? If this is indeed an issue, should it be opened on GitHub?

#12

I know that when recording, the site shows sentences with the fewest recordings. I think validation should work in a similar way: prioritize unique recordings above repeats.

(Stefan Falk) #13

@kdavis So if I am getting this right the train, test and dev tsv-files are so small because there are so many duplicates? This means that only 5% of validate.tsv made it into the train/test/dev split:

$ cat train.tsv test.tsv dev.tsv | wc -l
26170

$ cat validated.tsv | wc -l
490484

The thing you said about repeated sentences:

A similar but less extreme case arises with repeated sentences. If there are repeated sentences in the training set, then an engine trained on these repeated sentences will be biased to always try and “hear” these sentences.

Do you know how big this danger is? Could you reference any literature? I would have guessed that repeated sentences by different people might contribute some variance which might be useful to get a more robust model. Considering accents or mistakes which could be autocorrected by the language model etc.

Regarding the sentences presented to the readers:

It seems that the text-corpus is too small. Wouldn’t it be possible to present sentences just once?

(Jeff Ward) #14

It is my current understanding that the train/dev/test sets are completely re-generated each release with no guarantee that the previous split data will be reflected so I would caution against using the released splits as an academic source. See this thread: Dataset split best practices?

(kdavis) #15

It’s obvious the danger is there. The question as to how “big” the danger is is hard to quantify, and I don’t have specific literature to point to.

With the current release we’ve been as conservative as possible, no repeats in train.stv, dev.tsv, and test.tsv. However, we have also included validated.tsv. So if you, or anyone else, wants to include repeats in their results they are free to do so.

I agree with you in regards to some repeats providing a somewhat more “robust” model. However, as you can clearly see, the number of repeats is a grey zone. Are 3 repeats OK, but 4 too many? To avoid that debate altogether we allowed no repeats and included validated.tsv to allow people, if they so desire, to include as many or as few repeats as they desire.

I agree.

It would but this is not how the site is currently programmed.

(Stefan Falk) #16

Thank you for your reply @kdavis.

What you say makes sense. In the end I think we might just haveto try and evaluate what works best.

Maybe we see a larger pool of sentences in the future in order to avoid this danger altogether :slight_smile: