Thanks for making Common Voice! I’m a speech researcher and I want use it for my experiments. Unfortunately there is a big problem with how the corpus (v1) and specifically the train/test/dev split is designed. This issue is also being discussed here: https://github.com/kaldi-asr/kaldi/issues/2141 (Commonvoice results misleading, complete overlap of train/dev/test sentences #2141)
There is a huge overlap of train/dev/test sentences:
unique sentences in train: 6994
unique sentences in dev: 2410
unique sentences in test: 2362
common sentences train/dev (overlap) = 2401
common sentences train/test (overlap) = 2355
There should ideally be no overlap at all. Also, there are not that many unique sentences, which make the problem worse. I suggest to record some new sentences and make a new dev and test with no overlap.
Otherwise, any WER numbers reported on this train/dev/test split are pretty much meaningless unfortunately and encourage absolute overfitting on the training data. Also best results are obtained when the language model is trained only on the train sentences, without any other sentences. This is pretty much what the Kaldi recipe is doing now (https://github.com/kaldi-asr/kaldi/blob/master/egs/commonvoice) and it encourages recognizing only these ~7000 sentences and nothing else really. The reported 4% WER highlights the problem. This is surely not the intension of the Common Voice corpus.
Let me know if this is the right place to address these concerns.