Rejected audio dataset

nmstoker · April 3, 2019, 8:10pm

Is there a plan to release rejected audio for sentences at all?

I could have imagined it, but I’m sure I saw that idea discussed (although I can’t find it now )

This would be of interest for a few learning scenarios. My main one was to test some automatic sentence checking - I have “known good” examples but would be ideal to use some genuine “known bad” cases. I can create fake bad cases (eg corrupting the audio or swapping the audio with a different sentence text) but thought it good to ensure I had a representative sample.

When I get a bit further, I’ll put together a thread on the checking process, but the basic idea is to take one (or preferably a couple) of a “known good” sentence, run it through a syllable count of the audio and a few other things and then compare an unknown sample against that. The trick is to make it flexible enough to capture variations in how things are said, whilst still distinguishing audio that’s not correct. There are some Python tools based on Praat that seem helpful

kdavis · April 4, 2019, 10:05am

@nmstoker Rejected audio is released. Look at the invalidated.tsv file in the release data.

nmstoker · April 5, 2019, 12:23am

Ah, didn’t see that. Thank you.

Topic		Replies	Views
Subpar data uses Common Voice dataset	7	1520	June 5, 2019
Grammatically poor sample sentences Common Voice sentence-collection	23	1980	April 29, 2019
Discussion of new guidelines for recording validation Common Voice feedback	81	20430	November 29, 2021
Single Sentence Record Limit feature release Common Voice announcements	18	3171	June 13, 2022
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	381	February 28, 2025

Rejected audio dataset

Related topics