The datasets are useless

Bernardo_olisan · April 12, 2022, 1:26pm

The problem that this dataset have is that the audios are not clean, they have a lot of noise and other stuff that makes the model overfitted, in fact, librispeech works better bc they are cleaned

heyhillary · April 12, 2022, 2:20pm

Hey Bernardo,

Thanks for joining the Common Voice discourse and for your feedback regarding the dataset.

I wanted to highlight that with Common Voice our validation guidelines encourage voice recordings to be done in real-world environments, so TTS can be trained to understand how real people speak - but also within boundaries to support the vitality of the dataset.

Is it possible to explain more about the noise and more details regarding the impact on model overfitting you experienced ?

bozden · April 12, 2022, 6:51pm

@Bernardo_olisan: Shouldn’t it be the other way around? If you train your model only on clean data, you will get overfitting, i.e. it will not generalize enough and fail with real-world data. Therefore you usually augment your dataset while training to get a better model.

Think of the Common Voice dataset as already-somewhat-augmented - with all different device qualities, background noises, etc. Therefore, during training, you may like to augment it less than usual. (BTW, it is not for TTS, it is for STT).

If the recorded voice does not match the text, there will be a problem of course. But also, in this case, different kinds of pronunciations of the same words by different people (accents, etc) will be beneficial. The validation criteria mentioned above seem to be relaxed but they are important. The more people volunteering, the more text corpus and more recordings, the better will the models perform, as usual…

Another problem with Librispeech or similar datasets is: They are provided in limited languages. Common Voice supports all languages and has 84+ language datasets as of now.

Try to train with CV dataset and test with Librispeech clean & normal; and the other way around…

Welcome to Common Voice btw

Bernardo_olisan · April 12, 2022, 7:43pm

Some time ago I created an ASR using the technology of transformers nn, I used commonvoice and the predictions were horrible, although my loss function was under 0.5 they were still horrible, I entered the audios and I realized that they were very uneven, not clean. In the end I used Librispeech clean, and although my loss function was 0.7, the predictions were very good.

bozden · April 12, 2022, 8:09pm

I’m afraid, it will be nearly impossible for anyone to comment on this without further information.

In any case, this might be a good read: https://arxiv.org/pdf/2010.11745.pdf