Kinyarwanda broken files

Hi everyone!
I’ve been recently working with the Kinyarwanda dataset and I noticed that a lot of the files in the training set are “broken” in two ways:

  1. No speech recorded in the audio segment, just some background noise.
  2. Strange sample rate distortion. It seems to be exclusive to files with sample rate of 32000. The speech looks like it has been sped up, but slowing it doesn’t help.

I understand that these errors are due to resources needed to validate this huge dataset. I post this to let researches know that you’ll need to clean this dataset first in order to train some networks.

During EDA I came up with a heuristic to clean files for now, later I will try to do that with a trained network.

If we plot the dependency between audio duration and number of characters in the phrase (here I’ve also set constraints: duration < 10, number of characters < 200) we can roughly clean out all the outliers. Here the color is sample rate, yellow is 32000 blue is 48000. The blue line is the heuristic function 50*sqrt(duration-1)+10.