As of now, is Deep Speech viable for real-world applications?

Shouldn’t adding a buffer time before and after the webrtcvad output solve the problem?

For example,
If VAD says the voice lies between 4.20 sec(start) and 6.80 sec(end)
we can cut the chunk from
4.18 sec to 6.82 sec
i.e. a 20 ms buffer time, before and after the start and end time

The only problem here would be to choose the exact buffer time to use.

Am i correct in following this approach to deal with this error?
Thanks in advance