Longer audio files with Deep Speech

Hi, I was testing out Deep Speech again some rather long audio files. Around 7000 seconds. The output is only ever a single line, in this case it was “bononsgtleafoaerarrbergthomrmooaheearanheersbroreretrwnobreorrthoofouddooopokraimaoraetoharetulriterarteorpooooppisti”. I was using the output and language models supplied with the release for this test. I got similar results with files in the multiple minute range. The sample files provided with the release all worked great.

If I want to get more sensible results out of deep speech should I be reducing the length of the audio files? Is there a recommended maximum length for audio? or would I be seeing poor results for a different reason?

3 Likes

Deep Speech was trained on audio files that are “sentence length”, about 4-5seconds. So it deals best with “sentence length” chunks of audio.

You can use voice activity detection to cut your audio in to “sentence length” chunks. For example in Python you could use webrtcvad; I haven’t tried it myself. In Node.js you could use voice-activity-detection; also, I haven’t tried it myself. In C++ you could use the raw WebRTC VAD.

2 Likes

I actually tried the workaround suggested here, and implemented a VAD based solution to segment my long audio clip into shorter segments of one to ten seconds, feeding each chunk independently at inference time. However, I still get long words chunked together incorrectly. I then arbitrarily made my audio chunks two seconds rather than using VAD and still get words stuck together.

I believe this means that the original suggestion that the problem is long audio is incorrect. Something more fundamental is going wrong in Mozilla’s DeepSpeech implementation. I’ll continue investigating, but do you have any other ideas?

1 Like

Yeah, I had a similar experience. In fact, results of the acoustic model alone were better in these cases so I suspect it’s the language model application that is the culprit.

2 Likes

@bradneuberg @yv001 I have the same issue. Have you guys figured out why this is happening? One idea is to use the acoustic model and building our own language model. I don’t know how feasible it is but if the Mozilla people have some kind of provision for LM implementation for specific use cases left to developers, like in my case I don’t require a very generic LM but a very limited command based language model.

I believe this is being delt with in https://github.com/mozilla/DeepSpeech/issues/1156 from mozilla side.
Though I am not sure what the progress is on the issue.

If you want to do your own implementation of LM scoring, you’d need to do some substantial changes in native_client deepspeech code (e.g. beam_search.h, beam_search.cc, deepspeech.cc,deepspeech.h etc.) and rebuild it yourself.

1 Like

I tested the Python one you mentioned ( https://github.com/wiseman/py-webrtcvad ) today, and had a small WAV of 55 words, 275 characters and length 19.968 seconds

It split the WAV into 5 WAV files

00:00:01.59 — 3 words
00:00:06.87 — 21 words
00:00:03.63 — 12 words
00:00:03.33 - 10 words
00:00:02.43 — 9 words

and playing all the files in the same sequence as the original WAV, there appears to be no word truncation. I don’t understand the aggressiveness setting, and simply ran it with a ‘0’

As I was matching up the (real) transcript with the audio from this python tool, also found that where sentences ended, the audio also ended. Possibly it is picking up the additional pause/wait at the end of a sentence ?

The total length of the 5 WAV audios is 17.85 seconds, where the original was 19.968 seconds. What was ‘dropped’ was obviousy noise and not speech.

hth

The results of testing the VAD tool are at TUTORIAL : How I trained a specific french model to control my robot

Well, this is a problem for us, big problem. We are working with Oral History interviews that has 1 and half hour length that are associated to video file. We need a perfect co-relation between the time in the transcription text and the time of the video files. Why?, because the transcription is used for search and find video fragments or build the subtitles or make indexation, etc.
If we don’t has a perfect time co-relation between video and transcription all searches inside OH catalogues fail (you can find the text but you will not listen the video in the correct time and the the result is a failed search…)

We are try to implement DeepSpeech for our project for management Cultural Heritage and Oral History, Dédalo, but we are find lot of problems for implement it, only because DeepSpeech can`t process long files, and audio segmentation IS NOT THE SOLUTION.

We know that the “short commands” is a big area (maybe where the market/money is), but we think that implement a long audio process open other uses like Oral History or Dictation…

Because DeepSpeech can`t process long files, and audio segmentation IS NOT THE SOLUTION .

Wouldn’t audio segmentation work here, if you keep track of the timestamps of each segment?

A lot has happened since this thread started. We now have streaming as well as an API that allows getting timestamp. Addressing this kind of usecase should be much much simpler (we have contributors sharing feedback that it is working quite well). If you have more feedback to share, you are welcome.

Could you give a bit more details? Is the segmentation of longer audio files not needed anymore?

This is a years-old topic, and your question is a bit vague. Can you be more specific ?