Using the python package on longer audio

saseptim · March 1, 2018, 11:04am

Hi

The documentation states that the python package can be used to run the pretrained model on clips of around 5 seconds. Why is there such a limitation? Librivox samples can run up to 20 seconds long.

kdavis · March 2, 2018, 3:12pm

The neural net architecture used and the GPU memory are the limiting factors and are independent of the Librivox sample size.

saseptim · March 4, 2018, 10:50am

So basically, if I have a GPU or CPU with lots of memory, then I can run on more than 5 seconds?

kdavis · March 4, 2018, 10:58am

No. As mentioned, the neural net architecture used is also a limiting factor. BRNN can not handle arbitrarily long sequences well.

You can experiment with longer audio clips but the recognition quality will likely suffer.

saseptim · March 4, 2018, 11:04am

I see, but in order to achieve such good results on Librivox, you must be able to get good results on its samples which are up to 35 seconds long, so how does that work?
Also, I noticed that zero-padding the input audio signal hurts the results, is this characteristic of the BRNN?

kdavis · March 4, 2018, 12:06pm

We get good result on the Librivox clean test set.

Some of the Librivox clean test samples are longer than 5secs; some shorter. 5 sec is not a hard and fast rule. However, longer clips will run in to the problem I mentioned, BRNN’s can not handle arbitrarily long sequences well.

Topic		Replies	Views
Why 5s audio? DeepSpeech	4	482	June 26, 2019
Information on training and inferring audio file length DeepSpeech	5	1139	August 15, 2018
DeepSpeech training voice sample duration DeepSpeech	6	802	January 13, 2020
Audio files for Deepspeech DeepSpeech	1	442	June 24, 2019
Longer audio files with Deep Speech DeepSpeech	12	12076	November 21, 2019

Using the python package on longer audio

Related topics