Text produced has long strings of words with no spaces

I’m not too sure about the VAD stuff, it’s quite intrusive, and will probably add more complexity when we don’t need any. If you think it’s useful to document that in the FAQ, why not, do that :).

Basically the deal is that what we train on is few minutes at top, so the model gets tricked with much longer sequences. This might change in the future, however.

Actually I implemented both VAD and fixed length chunking and the problem remains. I think something else is broken. I documented more details here: Longer audio files with Deep Speech

It might also just be the result of training VS real world usage. We know that non-native american speakers have less good results (myself included) because of the training dataset. Hopefully when training includes broader accents it will be better. If you can record clear audio clips of 5 - 10 secs (sometimes, microphones produces strange stuff also) and make sure you try with and without the language model :slight_smile:

I don’t believe it’s the accents or acoustic model - if I run without the language model no words are put together, so the problem looks like it’s in the language model somehow.

I’d like to add some custom words to the language model to see if that helps the garbled words issue, but I can’t regenerate it since the real vocab.txt is not available. Is there any way I can privately get it to aid debugging and testing of this issue?

Can you share the output with and without the language model, please ? And with short and long audio clips ?

The text used to train the language model was/is a combination of texts from the Fisher, Switchboard, and other corpora. As Fisher + Switchboard are licensed to only be used within Mozilla, unfortunately, I can’t provide the text used to train the language model to you.

The issue of long audio files is addressed in the README

Once everything is installed you can then use the deepspeech binary to do speech-to-text on short, approximately 5 second, audio files (currently only WAVE files with 16-bit, 16 kHz, mono are supported in the Python client)

@bradneuberg Could you provide example audio clips? This would help locate the source of your problem

Okay, I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.

I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of my full 15 minute audio file:

The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time

Running inference for chunk 2
omiokaarforfthelastquarterwastoget

Running inference for chunk 3
to car to state deloedmarchinstrumnalha

Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat

Running inference for chunk 5
i am a to do that you know 

Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing 

Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape 

Running inference for chunk 8
out from sir handler and i are on new 

Running inference for chunk 9
he is not monolithic am andthanducotingswrat 

Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time

Running inference for chunk 2
omiokaar forf the last quarter was to get

Running inference for chunk 3
oto car to state deloed march in strumn alha

Running inference for chunk 4
um ton product  caser egauges somd produc sidnel from that

Running inference for chunk 5
am ah to do that ou nowith

Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga

Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha

Running inference for chunk 8
rout frome sir hanler and ik ar on newh

Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat 

Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.

So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill 
processing, we have some tech debt that we would need to do to split the CAPE 
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from this message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.

Okay this is clearly a bug, so I’ve opened a bug on the official Mozilla DeepSpeech repo with clear details to repro it:

1 Like

@bradneuberg Thanks for the detailed issue!

Hello dear developers of DS,

You have been doing a phenomenal job with each release of DeepSpeech! Thank you and way to go!!

At this moment, I am concerned about this issue of long strings without spaces…

Here is an inferencing output from a DS model trained with TED training set, with the test set from TED test set.

I used bin/run-ted.sh (from DS repo) with batch_size = 32 and trained on 4 GPUs, effectively making the minibatch size = 128.

==========================================
truth — they used the channels to pull water back in they flooded the canals
hypothesis — i use the channels to pull water back in the flotathecanose

truth — the farm 's incredible i mean you 've never seen anything like this
hypothesis — the farmsincrodiidmthanyounemberasem hing like this

truth — i was there not long ago with miguel
hypothesis — i was there not long velithmea

truth — like three parts charles darwin and one part crocodile dundee
hypothesis — i threepivethriwsdollandandonepartackatoutonme

===================================================

I suppose the TED vocabulary words did not go into your lm.binary & trie generation and that is one reason why we are getting into trouble like this.

I understand you are working vigorously towards a solution. Any idea when a fix will be available?

Is there a workaround you can suggest in the interim?

thank you,
regards,
Buvana

There hasn’t been any activity around this issue for quite a while. Is this deferred for some reason or is it going to be resolved as a part of another feature in coming versions?

I am running into the problem with pretty much any non trivial input.

I guess current @reuben’s priority is getting streaming to work :slight_smile:

Btw, this bug caused us (Dropbox) to stop evaluating Mozilla DeepSpeech as a potential solution to something we are working on. It seems to affect most transcriptions we have evaluated vs other solutions.

@bradneuberg Sorry to hear. Unfortunately, we’re quite a small team so tasks are queued up.

One of our developers @reuben is assigned to work on the problem, Issue 1156. However, he currently has all his time occupied by another large project, PR 1275 which allows the engine to do streaming speech recognition.

Once he’s finished with the PR he’ll have a bit more time to focus on Issue 1156

Hi, How can I train it without a language model?

Pure training never uses a language model.

It is only when is_display_step is active the language model is used to create a WER report indicating training progress.

Hi @bradneuberg , I too am working on a project that uses speech-to-text as one of the initial parts in the product architecture, and unfortunately this issue of outputting illegible words is driving us further away from Mozilla’s solution. I’m searching for other implementations that can get the job done, but I haven’t been able to find any other implementation that is this comprehensive and detailed. Would you be kind enough to suggest some of the other solutions that you evaluated, that are open source and seem to mitigate this problem ?