As of now, is Deep Speech viable for real-world applications?

which is rather curious as the models are trained on <10sec audio clips

Have you done any investigation where the errors on short segments come from? e.g.

  • acoustic vs language model
  • can you see the same errors when segmenting the audio manually rather than using webrtcVAD?