Regarding longer audio files, you might have hints in Longer audio files with Deep Speech. Long story short, the actual design with bidirectional recurrent layers requires us to have full knowledge of the audio we want to decode.
Regarding the cancelling for long audio, that feels like a good idea but then it means more questions: where do we draw the line? And moreover, it’s not just based on the audio length itself, it also depends on your hardware, and it might be very very different.
Said otherwise, given it’s alpha software, I think we should not spend our time on this kind of workaround and instead:
- optimize the network for requiring less resources
- enable the system to be streamable
That being said, if you want to submit a workaround doing this kind of limitation, we’ll be happy to help and review your patches 