Text produced has long strings of words with no spaces


(Bradneuberg) #1

Thanks for the Mozilla DeepSpeech project! Great open source contribution.

I’m getting long strings of words with no spaces. Example:

split the cape handler out from the sir hanler and make or on new hanlerswas not moneliticamandthenduconthingswerencaerulizationotenopuing paws on that until i its a product signalmyeanwhatwassomebiokarsthatyouwerefocusedtomtheturmeting projection so at

My file’s audio is a WAV file at 16 kHz with a bit depth of 16 mono, with the audio codec being PCM S16 LE. I’m using the default Python client to test things. The audio was recorded cleanly via a Mac OS X laptops microphone.

I’ve seen this with other audio samples I’ve tried. I looked through the Mozilla DeepSpeech’s github issues and didn’t see others reporting this. Is this a known issue? Are there any known workarounds (different audio setup, etc.)?

Thanks!

Best,
Brad Neuberg


(Lissyx) #2

How long is your audio file?


(Bradneuberg) #3

~15 minutes

Looks like this is a known issue reported by others in the message base: Longer audio files with Deep Speech

Maybe this should be added somewhere in the Mozilla DeepSpeech FAQ? Might be nice to make the Python client use VAD to segment the audio as a command line option to prevent this issue.


(Lissyx) #4

I’m not too sure about the VAD stuff, it’s quite intrusive, and will probably add more complexity when we don’t need any. If you think it’s useful to document that in the FAQ, why not, do that :).

Basically the deal is that what we train on is few minutes at top, so the model gets tricked with much longer sequences. This might change in the future, however.


(Bradneuberg) #5

Actually I implemented both VAD and fixed length chunking and the problem remains. I think something else is broken. I documented more details here: Longer audio files with Deep Speech


(Lissyx) #6

It might also just be the result of training VS real world usage. We know that non-native american speakers have less good results (myself included) because of the training dataset. Hopefully when training includes broader accents it will be better. If you can record clear audio clips of 5 - 10 secs (sometimes, microphones produces strange stuff also) and make sure you try with and without the language model :slight_smile:


(Bradneuberg) #7

I don’t believe it’s the accents or acoustic model - if I run without the language model no words are put together, so the problem looks like it’s in the language model somehow.


(Bradneuberg) #8

I’d like to add some custom words to the language model to see if that helps the garbled words issue, but I can’t regenerate it since the real vocab.txt is not available. Is there any way I can privately get it to aid debugging and testing of this issue?


(Lissyx) #9

Can you share the output with and without the language model, please ? And with short and long audio clips ?


(Kdavis) #10

The text used to train the language model was/is a combination of texts from the Fisher, Switchboard, and other corpora. As Fisher + Switchboard are licensed to only be used within Mozilla, unfortunately, I can’t provide the text used to train the language model to you.


(Kdavis) #11

The issue of long audio files is addressed in the README

Once everything is installed you can then use the deepspeech binary to do speech-to-text on short, approximately 5 second, audio files (currently only WAVE files with 16-bit, 16 kHz, mono are supported in the Python client)


(Kdavis) #12

@bradneuberg Could you provide example audio clips? This would help locate the source of your problem


(Bradneuberg) #13

Okay, I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.

I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of my full 15 minute audio file:

The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time

Running inference for chunk 2
omiokaarforfthelastquarterwastoget

Running inference for chunk 3
to car to state deloedmarchinstrumnalha

Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat

Running inference for chunk 5
i am a to do that you know 

Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing 

Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape 

Running inference for chunk 8
out from sir handler and i are on new 

Running inference for chunk 9
he is not monolithic am andthanducotingswrat 

Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time

Running inference for chunk 2
omiokaar forf the last quarter was to get

Running inference for chunk 3
oto car to state deloed march in strumn alha

Running inference for chunk 4
um ton product  caser egauges somd produc sidnel from that

Running inference for chunk 5
am ah to do that ou nowith

Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga

Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha

Running inference for chunk 8
rout frome sir hanler and ik ar on newh

Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat 

Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.

So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill 
processing, we have some tech debt that we would need to do to split the CAPE 
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from this message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.


(Bradneuberg) #14

Okay this is clearly a bug, so I’ve opened a bug on the official Mozilla DeepSpeech repo with clear details to repro it:


(Kdavis) #15

@bradneuberg Thanks for the detailed issue!