Using the python package on longer audio

(Avi) #1


The documentation states that the python package can be used to run the pretrained model on clips of around 5 seconds. Why is there such a limitation? Librivox samples can run up to 20 seconds long.

(kdavis) #2

The neural net architecture used and the GPU memory are the limiting factors and are independent of the Librivox sample size.

(Avi) #3

So basically, if I have a GPU or CPU with lots of memory, then I can run on more than 5 seconds?

(kdavis) #4

No. As mentioned, the neural net architecture used is also a limiting factor. BRNN can not handle arbitrarily long sequences well.

You can experiment with longer audio clips but the recognition quality will likely suffer.

(Avi) #5

I see, but in order to achieve such good results on Librivox, you must be able to get good results on its samples which are up to 35 seconds long, so how does that work?
Also, I noticed that zero-padding the input audio signal hurts the results, is this characteristic of the BRNN?

(kdavis) #6

We get good result on the Librivox clean test set.

Some of the Librivox clean test samples are longer than 5secs; some shorter. 5 sec is not a hard and fast rule. However, longer clips will run in to the problem I mentioned, BRNN’s can not handle arbitrarily long sequences well.