Why 5s audio?


I would like to know why does it state in the instructions that the recognition works only on 5s long audio files? will it not work on ones >10s or is it that the quality just drops?


This was an old constraint, and I can’t find references to it. Can you please link ?

Oh, that’s been lifted now? I was still splitting everything up into small segments.

Is the length limit still there, only higher, or can we assume it to be unlimited?

Well, we had that “”“limit”"" back then because we knew the model was not performing so well on long audio, due to the bidirectionnal layer. Now we have proper streaming, it should be better.

If you are referring to the training part, there’s still some kind of limit, because too long audio will make it hard to fit into GPUs memory.

I read it on github here
thought that’s still the case since its mentioned on the project homepage:

Once everything is installed, you can then use the deepspeech binary to do speech-to-text on short (approximately 5-second long) audio files

anyway thanks for the response