which is rather curious as the models are trained on <10sec audio clips
Have you done any investigation where the errors on short segments come from? e.g.
- acoustic vs language model
- can you see the same errors when segmenting the audio manually rather than using webrtcVAD?