Identifying bad clips in dataset releases

I have a script that’s helping me identify bad clips in the Common Voice dataset that shouldn’t have been approved.

I’m building up a list at https://github.com/dabinat/bad-cv-clips so if anyone else has identified or filtered bad clips, feel free to submit a PR so everyone can benefit. I’m waiting for the list to grow a bit more before I submit it as an issue for the CV team to fix.

P.S. I have another script to make it really easy to filter these clips out of your CSVs.

1 Like

Is it possible to elaborate a bit more on the the methodology used to detect “bad” clips?

Thanks!

It’s explained in more detail in the readme but the basic idea is that it looks for big differences in the number of expected words vs the number of words it receives back from DeepSpeech.

So if it’s expecting 5 words but it receives 10 back, that could be an indication that the user repeated the sentence or that someone else talked during the recording. If you’re expecting 12 and receive 3, that could indicate a truncated recording, excessive background noise or a recording that’s too quiet.

It’s worth mentioning that the only automated process is identifying potential problem clips. I am still manually reviewing them before I put them in the bad clip CSV. I am using the same criteria I would if I was validating through the web site and so far have identified robotic/filtered voices, recordings too quiet, incorrect transcripts, truncated recordings, clipped audio, repetitions and noise drowning out words. The CSV contains the original expected transcript to make it easy for others to check my work.

2 Likes

This is really interesting, and could potentially open the gate to use deepspeech to signal these clips directly on the site.

I don’t know if it would be possible, but live feedback to the person recording would be amazing: “Oops, it seems your recording had some issues, please check again”

Maybe 20-25% of the clips it flags up are “bad”, with the rest generally being challenging accents. Maybe this can be reduced by tweaking the parameters, but the danger of using it on the site is that it biases Common Voice to the types of accents and noise environments that are already in the dataset. It would reduce diversity.

But if the goal is to reduce user recording errors, a mic test feature that allows the user to determine if their mic is setup and the volume is ok, would go a long way.

1 Like