Longer audio files with Deep Speech

(Ian) #1

Hi, I was testing out Deep Speech again some rather long audio files. Around 7000 seconds. The output is only ever a single line, in this case it was “bononsgtleafoaerarrbergthomrmooaheearanheersbroreretrwnobreorrthoofouddooopokraimaoraetoharetulriterarteorpooooppisti”. I was using the output and language models supplied with the release for this test. I got similar results with files in the multiple minute range. The sample files provided with the release all worked great.

If I want to get more sensible results out of deep speech should I be reducing the length of the audio files? Is there a recommended maximum length for audio? or would I be seeing poor results for a different reason?

Text produced has long strings of words with no spaces
Can DeepSpeech process longer audio files?
Can i train the model with longer audio files?
How does DeepSpeech discriminate between speech-music?
Can DeepSpeech process longer audio files?
Text produced has long strings of words with no spaces
(kdavis) #2

Deep Speech was trained on audio files that are “sentence length”, about 4-5seconds. So it deals best with “sentence length” chunks of audio.

You can use voice activity detection to cut your audio in to “sentence length” chunks. For example in Python you could use webrtcvad; I haven’t tried it myself. In Node.js you could use voice-activity-detection; also, I haven’t tried it myself. In C++ you could use the raw WebRTC VAD.

(Bradneuberg) #3

I actually tried the workaround suggested here, and implemented a VAD based solution to segment my long audio clip into shorter segments of one to ten seconds, feeding each chunk independently at inference time. However, I still get long words chunked together incorrectly. I then arbitrarily made my audio chunks two seconds rather than using VAD and still get words stuck together.

I believe this means that the original suggestion that the problem is long audio is incorrect. Something more fundamental is going wrong in Mozilla’s DeepSpeech implementation. I’ll continue investigating, but do you have any other ideas?

(Yv) #4

Yeah, I had a similar experience. In fact, results of the acoustic model alone were better in these cases so I suspect it’s the language model application that is the culprit.

(Rajateku6) #5

@bradneuberg @yv001 I have the same issue. Have you guys figured out why this is happening? One idea is to use the acoustic model and building our own language model. I don’t know how feasible it is but if the Mozilla people have some kind of provision for LM implementation for specific use cases left to developers, like in my case I don’t require a very generic LM but a very limited command based language model.

(Yv) #6

I believe this is being delt with in https://github.com/mozilla/DeepSpeech/issues/1156 from mozilla side.
Though I am not sure what the progress is on the issue.

If you want to do your own implementation of LM scoring, you’d need to do some substantial changes in native_client deepspeech code (e.g. beam_search.h, beam_search.cc, deepspeech.cc,deepspeech.h etc.) and rebuild it yourself.


I tested the Python one you mentioned ( https://github.com/wiseman/py-webrtcvad ) today, and had a small WAV of 55 words, 275 characters and length 19.968 seconds

It split the WAV into 5 WAV files

00:00:01.59 — 3 words
00:00:06.87 — 21 words
00:00:03.63 — 12 words
00:00:03.33 - 10 words
00:00:02.43 — 9 words

and playing all the files in the same sequence as the original WAV, there appears to be no word truncation. I don’t understand the aggressiveness setting, and simply ran it with a ‘0’

As I was matching up the (real) transcript with the audio from this python tool, also found that where sentences ended, the audio also ended. Possibly it is picking up the additional pause/wait at the end of a sentence ?

The total length of the 5 WAV audios is 17.85 seconds, where the original was 19.968 seconds. What was ‘dropped’ was obviousy noise and not speech.


TUTORIAL : How I trained a specific french model to control my robot

The results of testing the VAD tool are at TUTORIAL : How I trained a specific french model to control my robot