Hi everyone!
I’ve been recently working with the Kinyarwanda dataset and I noticed that a lot of the files in the training set are “broken” in two ways:
- No speech recorded in the audio segment, just some background noise.
- Strange sample rate distortion. It seems to be exclusive to files with sample rate of 32000. The speech looks like it has been sped up, but slowing it doesn’t help.
I understand that these errors are due to resources needed to validate this huge dataset. I post this to let researches know that you’ll need to clean this dataset first in order to train some networks.
During EDA I came up with a heuristic to clean files for now, later I will try to do that with a trained network.
If we plot the dependency between audio duration and number of characters in the phrase (here I’ve also set constraints: duration < 10, number of characters < 200) we can roughly clean out all the outliers. Here the color is sample rate, yellow is 32000 blue is 48000. The blue line is the heuristic function 50*sqrt(duration-1)+10.