Text produced has long strings of words with no spaces

bradneuberg · January 4, 2018, 8:40pm

Thanks for the Mozilla DeepSpeech project! Great open source contribution.

I’m getting long strings of words with no spaces. Example:

split the cape handler out from the sir hanler and make or on new hanlerswas not moneliticamandthenduconthingswerencaerulizationotenopuing paws on that until i its a product signalmyeanwhatwassomebiokarsthatyouwerefocusedtomtheturmeting projection so at

My file’s audio is a WAV file at 16 kHz with a bit depth of 16 mono, with the audio codec being PCM S16 LE. I’m using the default Python client to test things. The audio was recorded cleanly via a Mac OS X laptops microphone.

I’ve seen this with other audio samples I’ve tried. I looked through the Mozilla DeepSpeech’s github issues and didn’t see others reporting this. Is this a known issue? Are there any known workarounds (different audio setup, etc.)?

Thanks!

Best,
Brad Neuberg

lissyx · January 4, 2018, 8:42pm

How long is your audio file?

bradneuberg · January 4, 2018, 8:45pm

~15 minutes

Looks like this is a known issue reported by others in the message base: Longer audio files with Deep Speech

Maybe this should be added somewhere in the Mozilla DeepSpeech FAQ? Might be nice to make the Python client use VAD to segment the audio as a command line option to prevent this issue.

lissyx · January 4, 2018, 9:22pm

I’m not too sure about the VAD stuff, it’s quite intrusive, and will probably add more complexity when we don’t need any. If you think it’s useful to document that in the FAQ, why not, do that :).

Basically the deal is that what we train on is few minutes at top, so the model gets tricked with much longer sequences. This might change in the future, however.

bradneuberg · January 4, 2018, 11:11pm

Actually I implemented both VAD and fixed length chunking and the problem remains. I think something else is broken. I documented more details here: Longer audio files with Deep Speech

lissyx · January 5, 2018, 4:56am

It might also just be the result of training VS real world usage. We know that non-native american speakers have less good results (myself included) because of the training dataset. Hopefully when training includes broader accents it will be better. If you can record clear audio clips of 5 - 10 secs (sometimes, microphones produces strange stuff also) and make sure you try with and without the language model

bradneuberg · January 5, 2018, 4:31pm

I don’t believe it’s the accents or acoustic model - if I run without the language model no words are put together, so the problem looks like it’s in the language model somehow.

bradneuberg · January 5, 2018, 4:36pm

I’d like to add some custom words to the language model to see if that helps the garbled words issue, but I can’t regenerate it since the real vocab.txt is not available. Is there any way I can privately get it to aid debugging and testing of this issue?

lissyx · January 5, 2018, 4:40pm

Can you share the output with and without the language model, please ? And with short and long audio clips ?

kdavis · January 5, 2018, 6:09pm

The text used to train the language model was/is a combination of texts from the Fisher, Switchboard, and other corpora. As Fisher + Switchboard are licensed to only be used within Mozilla, unfortunately, I can’t provide the text used to train the language model to you.

kdavis · January 5, 2018, 6:11pm

The issue of long audio files is addressed in the README

Once everything is installed you can then use the deepspeech binary to do speech-to-text on short, approximately 5 second, audio files (currently only WAVE files with 16-bit, 16 kHz, mono are supported in the Python client)

kdavis · January 5, 2018, 6:13pm

@bradneuberg Could you provide example audio clips? This would help locate the source of your problem

bradneuberg · January 5, 2018, 10:43pm

Okay, I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.

I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of my full 15 minute audio file:

The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time

Running inference for chunk 2
omiokaarforfthelastquarterwastoget

Running inference for chunk 3
to car to state deloedmarchinstrumnalha

Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat

Running inference for chunk 5
i am a to do that you know 

Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing 

Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape 

Running inference for chunk 8
out from sir handler and i are on new 

Running inference for chunk 9
he is not monolithic am andthanducotingswrat 

Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time

Running inference for chunk 2
omiokaar forf the last quarter was to get

Running inference for chunk 3
oto car to state deloed march in strumn alha

Running inference for chunk 4
um ton product  caser egauges somd produc sidnel from that

Running inference for chunk 5
am ah to do that ou nowith

Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga

Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha

Running inference for chunk 8
rout frome sir hanler and ik ar on newh

Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat 

Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.

So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill 
processing, we have some tech debt that we would need to do to split the CAPE 
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from this message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.

bradneuberg · January 5, 2018, 11:07pm

Okay this is clearly a bug, so I’ve opened a bug on the official Mozilla DeepSpeech repo with clear details to repro it:

kdavis · January 8, 2018, 12:39pm

@bradneuberg Thanks for the detailed issue!

b.r · March 9, 2018, 6:32pm

Hello dear developers of DS,

You have been doing a phenomenal job with each release of DeepSpeech! Thank you and way to go!!

At this moment, I am concerned about this issue of long strings without spaces…

Here is an inferencing output from a DS model trained with TED training set, with the test set from TED test set.

I used bin/run-ted.sh (from DS repo) with batch_size = 32 and trained on 4 GPUs, effectively making the minibatch size = 128.

==========================================
truth — they used the channels to pull water back in they flooded the canals
hypothesis — i use the channels to pull water back in the flotathecanose

truth — the farm 's incredible i mean you 've never seen anything like this
hypothesis — the farmsincrodiidmthanyounemberasem hing like this

truth — i was there not long ago with miguel
hypothesis — i was there not long velithmea

truth — like three parts charles darwin and one part crocodile dundee
hypothesis — i threepivethriwsdollandandonepartackatoutonme

===================================================

I suppose the TED vocabulary words did not go into your lm.binary & trie generation and that is one reason why we are getting into trouble like this.

I understand you are working vigorously towards a solution. Any idea when a fix will be available?

Is there a workaround you can suggest in the interim?

thank you,
regards,
Buvana

yv001 · March 9, 2018, 6:44pm

There hasn’t been any activity around this issue for quite a while. Is this deferred for some reason or is it going to be resolved as a part of another feature in coming versions?

I am running into the problem with pretty much any non trivial input.

lissyx · March 9, 2018, 6:50pm

I guess current @reuben’s priority is getting streaming to work

bradneuberg · March 9, 2018, 6:58pm

Btw, this bug caused us (Dropbox) to stop evaluating Mozilla DeepSpeech as a potential solution to something we are working on. It seems to affect most transcriptions we have evaluated vs other solutions.

kdavis · March 9, 2018, 7:22pm

@bradneuberg Sorry to hear. Unfortunately, we’re quite a small team so tasks are queued up.

One of our developers @reuben is assigned to work on the problem, Issue 1156. However, he currently has all his time occupied by another large project, PR 1275 which allows the engine to do streaming speech recognition.

Once he’s finished with the PR he’ll have a bit more time to focus on Issue 1156