This mostly happens on longer sentences. Probably, people hit the record button and then read the sentence silently, before actually speaking it. This results a long silence (like 10 sec) in front of the clip.
People who are validating these clips would most possibly think that the recording is empty and hit NO button.
Is it possible pre-process the recordings to trim out the silent parts before pushing them to the listen queue? I saw a normalization request in github too…