As of now, is Deep Speech viable for real-world applications?

By acoustic vs language model do you mean trying transcription without the language model? If so, we did that accidentally in older versions of DeepSpeech Frontend, the results were poor. This was prior to VAD though.

Manually breaking apart audio doesn’t create these incorrect transcriptions. My working theory (corroborated by others on the #machinelearning IRC channel) is that WebRTCVAD is a touch aggressive in slicing up audio, so the beginning and the end of the word can sometimes get partially chopped off resulting in DeepSpeech not understanding the malformed audio its given.