Can DeepSpeech process longer audio files?


#1

The other day I was testing DeepSpeech and by mistake specified a long audio file. Longer than 10 seconds. It was one of the 45 to 60 minute audios.

Computer froze totally, no response, hard drive light on full. Only resolve was to power down. Now that computer has a ‘clunk’ noise when I power up. Past experiences have shown me this is early warnings of HDD failure.

Maybe the developers can check an audio duration when DeepSpeech starts up and if more than (say) 60 seconds, give the option of cancelling/aborting.

The computer I was using has Processors: 4 x Intel Core i5-2430M CPU @ 2.40GHz
Memory: 3.8 GiB of RAM , so not what I would call ‘underpowered’. It was a gift from my son…hmm, not impressed. :frowning:

I appreciate DS is still alpha but one would hope that common sense would prevail.


(Lissyx) #2

Regarding longer audio files, you might have hints in Longer audio files with Deep Speech. Long story short, the actual design with bidirectional recurrent layers requires us to have full knowledge of the audio we want to decode.

Regarding the cancelling for long audio, that feels like a good idea but then it means more questions: where do we draw the line? And moreover, it’s not just based on the audio length itself, it also depends on your hardware, and it might be very very different.

Said otherwise, given it’s alpha software, I think we should not spend our time on this kind of workaround and instead:

  • optimize the network for requiring less resources
  • enable the system to be streamable

That being said, if you want to submit a workaround doing this kind of limitation, we’ll be happy to help and review your patches :slight_smile:


#3

Regarding longer audio files, you might have hints in Longer audio files with Deep Speech. Long story short, the actual design with bidirectional recurrent layers requires us to have full knowledge of the audio we want to decode.

Thanks, yes I did read through that post before starting this thread, but that was more about how to train, not an answer as such, about can ‘Deepspeech process (produce a transcripton) longer audio files’. The audios we have are from between 44 minutes and 1 hr 18 miutes. If I deployed the solution in that thread about cutting up just one audio, I would need 936 wav files for DeepSpeech to do the training.

Even if I did that for 1 audio (and there are hundreds), what will the expected output be. The accuracy of the transcript ? I calculated the WER of a 19 second audio transcript (output from DeepSpeech) and the error rate was about 46 %

Sure, if I spent the time building a specific model just for this speaker, that makes sense. But will I be able to run DeepSpeech on that computer after the training has been done ? Or will it consume so much resources that the computer freezes ? More hard drive damage ?

Regarding the cancelling for long audio, that feels like a good idea but then it means more questions: where do we draw the line? And moreover, it’s not just based on the audio length itself, it also depends on your hardware, and it might be very very different.

Yes, good point. How about enabling Ctrl-C at least ?

I’m wondering if I should just learn to touch type to produce the transcriptions. :wink:


(Lissyx) #4

Well, the thread I pointed you at contains an answer from Kelly, who explicitely documents that because of the architecture of the network and because of the current training dataset, processing very long audio will likely not work as expected :).

Regarding the accuracy, there are a number of other factors. You say 46% on 19 secs audio, that feels not expected, but it might depend on a lot of stuff: the dataset we have makes the model behave erratically if you don’t really have american english clean sound. It could also be mic interferences …

Besides, sorry, but DeepSpeech does not “damage” your computer. It’s computationnally intensive, but we know that, and again, we are working on that. But all of that takes time to accomplish properly. If you train for a specific speaker, I would suggest taking a look at TUTORIAL : How I trained a specific french model to control my robot where Vincent produced a model dedicated to himself, smaller and running good on NVIDIA GPU for his robot. It’s not magic, he has been able to produce enough audio data to train seriously, but he also reduced the model size, making it much smaller, and thus much less computationnally intensive. There’s a balance between the generalization capabilities of the model and its complexity.

For CTRL+C, I guess you are referring to Python binding? It’s likely the usual mess of Python and threads, not even sure we can do something to that.