Common Voice v1 corpus design problems, overlapping train/test/dev sentences

(milde) #1

Thanks for making Common Voice! I’m a speech researcher and I want use it for my experiments. Unfortunately there is a big problem with how the corpus (v1) and specifically the train/test/dev split is designed. This issue is also being discussed here: (Commonvoice results misleading, complete overlap of train/dev/test sentences #2141)

There is a huge overlap of train/dev/test sentences:

unique sentences in train: 6994
unique sentences in dev: 2410
unique sentences in test: 2362
common sentences train/dev (overlap) = 2401
common sentences train/test (overlap) = 2355

There should ideally be no overlap at all. Also, there are not that many unique sentences, which make the problem worse. I suggest to record some new sentences and make a new dev and test with no overlap.

Otherwise, any WER numbers reported on this train/dev/test split are pretty much meaningless unfortunately and encourage absolute overfitting on the training data. Also best results are obtained when the language model is trained only on the train sentences, without any other sentences. This is pretty much what the Kaldi recipe is doing now ( and it encourages recognizing only these ~7000 sentences and nothing else really. The reported 4% WER highlights the problem. This is surely not the intension of the Common Voice corpus.

Let me know if this is the right place to address these concerns.

Prompt design
(Michael Henretty) #2

(Copying my response from the github issue)

Hi @bmilde,

First of all, thank you for reporting this bug. It is indeed very critical. One of the reasons we wanted to release this data so quickly was to get this kind of feedback from people like you, so bravo!

I have spoken about this split with our machine learning group (which is different than the Common Voice team that I am a part of), and there are a couple of solutions we are investigating.

One, we can redo the split between dev/train/test to make sure there are no overlapping speakers or sentences. The problem with this is that the dev and test datasets will probably need to become a lot smaller due to our limited sentence corpus.

Another approach is to modify our Common Voice server (ie. this repo), to have special sentences that are in quarantined to the test/dev sets, and then make certain users only get those sentences for reading. This is a better approach in the long term, since it means we could grow the test/dev sets larger, and wouldn’t have to worry about throwing out any training data (again due to our small corpus size).

We will be investigating the above approaches in the coming weeks, and will definitely fix this in the next release of the data (v2). In the meantime, any advice or information you (or anyone reading this) has in regards to this problem would be very welcomed.