As of now, is Deep Speech viable for real-world applications?

yeah, one way to check the results of the acoustic model in greater detail is to look at the output of the softmax layer and see which characters were guessed as most probable,

e.g. is the expected character in 5 most probable characters guessed by the model?

if all expected characters are there after acoustic phase, then the language model part is the culprit

Even with VAD cutting first/last word, the middle part still should be reasonably transcribed unless the bad start/end of the word sequence throws the language model totally off.