I actually tried the workaround suggested here, and implemented a VAD based solution to segment my long audio clip into shorter segments of one to ten seconds, feeding each chunk independently at inference time. However, I still get long words chunked together incorrectly. I then arbitrarily made my audio chunks two seconds rather than using VAD and still get words stuck together.
I believe this means that the original suggestion that the problem is long audio is incorrect. Something more fundamental is going wrong in Mozilla’s DeepSpeech implementation. I’ll continue investigating, but do you have any other ideas?