Long silence in some recordings... Solution pre-trimming?

bozden · September 19, 2021, 3:24pm

This mostly happens on longer sentences. Probably, people hit the record button and then read the sentence silently, before actually speaking it. This results a long silence (like 10 sec) in front of the clip.

People who are validating these clips would most possibly think that the recording is empty and hit NO button.

Is it possible pre-process the recordings to trim out the silent parts before pushing them to the listen queue? I saw a normalization request in github too…

ftyers · September 20, 2021, 4:03am

This is currently not possible and would be difficult to engineer, even basic splitting based on noise would be fairly processor intensive and not necessarily reliable (e.g. it might have deleterious effects on clips that don’t have silence in). It’s not just a matter of fixing clips with silence, but also about potentially harming clips without silence. It would also be difficult to tune properly.

This is one issue that discusses it and this one too (for reference).

I think it’s worthwhile keeping the issue open though. In principle these clips could be recovered by dataset engineers after the fact by mining/processing the invalidated clips to find ones that are incorrectly marked as invalid.

bozden · September 20, 2021, 8:01am

Thank you for the info and pointers

The blank clip detection in the links is also important, I can never be sure if it is not because of my current connection (low bandwidth mobile data).

robovoice · October 22, 2021, 8:18pm

Maybe this request “convince” the contributer to re-record after hearing his own clip(s)

https://github.com/common-voice/common-voice/issues/3304

Razmik-Badalyan · October 2, 2022, 8:13am

Just to clarify, it’s ok if the recording has a long silence on either end of it, right? I have encountered a few of these and had given thumbs down. But reading this thread it seems to be ok if a recording has silent parts.

bozden · October 2, 2022, 12:10pm

Except the possibility of rejection as the original post suggests, there is only a minor side effect: Unnecessarily enlarging the data (validated hours, file sizes) or statistics (average seconds per sentence or voice etc).

So, they are OK if they fit other criteria…