This mostly happens on longer sentences. Probably, people hit the record button and then read the sentence silently, before actually speaking it. This results a long silence (like 10 sec) in front of the clip.
People who are validating these clips would most possibly think that the recording is empty and hit NO button.
Is it possible pre-process the recordings to trim out the silent parts before pushing them to the listen queue? I saw a normalization request in github too…
This is currently not possible and would be difficult to engineer, even basic splitting based on noise would be fairly processor intensive and not necessarily reliable (e.g. it might have deleterious effects on clips that don’t have silence in). It’s not just a matter of fixing clips with silence, but also about potentially harming clips without silence. It would also be difficult to tune properly.
This is one issue that discusses it and this one too (for reference).
I think it’s worthwhile keeping the issue open though. In principle these clips could be recovered by dataset engineers after the fact by mining/processing the invalidated clips to find ones that are incorrectly marked as invalid.
The blank clip detection in the links is also important, I can never be sure if it is not because of my current connection (low bandwidth mobile data).
Just to clarify, it’s ok if the recording has a long silence on either end of it, right? I have encountered a few of these and had given thumbs down. But reading this thread it seems to be ok if a recording has silent parts.
Except the possibility of rejection as the original post suggests, there is only a minor side effect: Unnecessarily enlarging the data (validated hours, file sizes) or statistics (average seconds per sentence or voice etc).