Subpar data uses

From my understanding common voice is supposed to be ideal reading (not ideal in audio fidelity but in accuracy (and enunciation?) of the words reads) of know text.

With this knowledge, is there any chance we could get the “sub-par” data as different set in the future?

The benefits to me are to help with edge cases of how people actually say words. Either to generate purposely “flawed” tts or to help with real world stt (in which people may say a word “wrong” a lot).


CV isn’t supposed to be recorded in ideal conditions, but in real conditions, often with background noise. So the quality of ‘good’ recordings varies hugely, and includes many where a particular word is pronounced in a variety of ways - including some that others might consider to be wrong.

A recording is defined as ‘good’ if it’s approved by at least two validation volunteers who have listened to the audio. Since the votes are noted it should be possible to extract ‘bad’ recordings by pulling out those where both voted to decline. Sub-par or edge cases would be where the two volunteers disagreed, and a third has used a casting vote to approve. That would create a dataset of ‘good’ recordings that at least one volunteer out of three believes to be bad.

I will qualify my "good " a little further.

But yes is there a data set of rejected clips?

Yes, rejected clips are available (that have two ‘decline’ votes). See here: Rejected audio dataset


Users appear to not know about the Skip button for some reason, so some just record silence in order to get to the next clip. So the invalid set may be a good source of background noise / roomtone for anyone willing to delve in and extract that data :slight_smile:


Hi! I’m working on a way to help everyone with pronunciation remediation for free. The best way to do so is to produce a pull request changing the binary thumbs-up-good/thumbs-down-bad feedback for likely exemplary pronunciations into transcription requests for subpar pronunciations. Please join the Spoken Language Interest Group of the IEEE Learning Technologies Standards Committee at and follow the main Discourse topic at: Intelligibility remediation

Thank you!

Michael, do you know the proportion of declined versus accepted recordings overall?

@jsalsman, I don’t know the exact proportion, though it would be possible to check. My best guess based on my experience of validating the recordings is that the error rate is perhaps 20% overall, rising to 30% to 40% for the recent Wikipedia sentences. There’s some further discussion in the thread: About the new English Sentences.

1 Like