Long silence in some recordings... Solution pre-trimming?

This mostly happens on longer sentences. Probably, people hit the record button and then read the sentence silently, before actually speaking it. This results a long silence (like 10 sec) in front of the clip.

People who are validating these clips would most possibly think that the recording is empty and hit NO button.

Is it possible pre-process the recordings to trim out the silent parts before pushing them to the listen queue? I saw a normalization request in github too…

2 Likes

This is currently not possible and would be difficult to engineer, even basic splitting based on noise would be fairly processor intensive and not necessarily reliable (e.g. it might have deleterious effects on clips that don’t have silence in). It’s not just a matter of fixing clips with silence, but also about potentially harming clips without silence. It would also be difficult to tune properly.

This is one issue that discusses it and this one too (for reference).

I think it’s worthwhile keeping the issue open though. In principle these clips could be recovered by dataset engineers after the fact by mining/processing the invalidated clips to find ones that are incorrectly marked as invalid.

2 Likes

Thank you for the info and pointers :slight_smile:

The blank clip detection in the links is also important, I can never be sure if it is not because of my current connection (low bandwidth mobile data).

2 Likes

Maybe this request “convince” the contributer to re-record after hearing his own clip(s)

https://github.com/common-voice/common-voice/issues/3304

1 Like

Just to clarify, it’s ok if the recording has a long silence on either end of it, right? I have encountered a few of these and had given thumbs down. But reading this thread it seems to be ok if a recording has silent parts.

1 Like

Except the possibility of rejection as the original post suggests, there is only a minor side effect: Unnecessarily enlarging the data (validated hours, file sizes) or statistics (average seconds per sentence or voice etc).

So, they are OK if they fit other criteria…

2 Likes