DeepSpeech generates long nonsense tokens as output

bwan03 · July 2, 2018, 1:24pm

I have tested using Deep Speech’s pretrained model (i.e. deepspeech-0.1.1-models) on a resampled-to-16hz wav file which consists of a 1 min recording. The result is kinda strange, as it generates very long tokens that do not make any sense at all. For example:

“rairaiaprgramsthereussey”,

“proremihadthemigtybetardprogramteyeradiqhorembertireseveted”

… …

It also doesn’t seem to be easy to segment these long tokens to several legit tokens. I have tried other speech-to-text APIs and have not encountered the same issue. I would be appreciated if anyone can shed light on this. Thanks!

lissyx · July 3, 2018, 7:32am

github.com/mozilla/DeepSpeech

Language model incorrectly drops spaces for out-of-vocabulary words

opened 11:06PM - 05 Jan 18 UTC

closed 08:10PM - 30 Mar 19 UTC

BradNeuberg

Mozilla DeepSpeech will sometimes create long runs of text with no spaces: ``…` omiokaarforfthelastquarterwastoget ``` This happens even with short audio clips (4 seconds) with a native American english speaker recorded using a high quality microphone in Mac OS X laptops. I've isolated the problem to interaction with the language model rather than the acoustic model or length of audio clips, as the problem goes away when the language model is turned off. The problem might be related to encountering out-of-vocabulary terms. I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model. I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of a fuller 15 minute audio file (I have not provided that full 15 minute file, as a few shorter reproducible chunks are sufficient to reproduce the problem): https://www.dropbox.com/sh/3qy65r6wo8ldtvi/AAAAVinsD_kcCi8Bs6l3zOWFa?dl=0 The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc. Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (`chunks_with_language_model.txt`): ``` Running inference for chunk 1 so were trying again a maybeialstart this time Running inference for chunk 2 omiokaarforfthelastquarterwastoget Running inference for chunk 3 to car to state deloedmarchinstrumnalha Running inference for chunk 4 a tonproductcaseregaugesomd produce sidnelfromthat Running inference for chunk 5 i am a to do that you know Running inference for chunk 6 we finish the kepehandlerrwend finished backfileprocessing Running inference for chunk 7 and is he teckdatthatwewould need to do to split the cape Running inference for chunk 8 out from sir handler and i are on new Running inference for chunk 9 he is not monolithic am andthanducotingswrat Running inference for chunk 10 relizationutenpling paws on that until it its a product signal ``` Then, I’ve provided similar output with the language model turned off (`chunks_without_language_model.txt`): ``` Running inference for chunk 1 so we're tryng again ah maybe alstart this time Running inference for chunk 2 omiokaar forf the last quarter was to get Running inference for chunk 3 oto car to state deloed march in strumn alha Running inference for chunk 4 um ton product caser egauges somd produc sidnel from that Running inference for chunk 5 am ah to do that ou nowith Running inference for chunk 6 we finishd the kepe handlerr wend finished backfile processinga Running inference for chunk 7 on es eteckdat that we would need to do to split the kae ha Running inference for chunk 8 rout frome sir hanler and ik ar on newh Running inference for chunk 9 ch las not monoliic am andthan ducotings wrat Running inference for chunk 10 relization u en pling a pas on that until it its a product signal ``` I’ve included both these files in the shared Dropbox folder link above. Here’s what the correct transcript should be, manually done (`chunks_correct_manual_transcription.txt`): ``` So, we're trying again, maybe I'll start this time. So my OKR for the last quarter was to get AutoOCR to a state that we could launch an external alpha, and product could sort of gauge some product signal from that. To do that we finished the CAPE handler, we finished backfill processing, we have some tech debt that we would need to do to split the CAPE handler out from the search handler and make our own new handler so its not monolithic, and do some things around CAPE utilization. We are kind of putting a pause on that until we get some product signal. ``` This shows the language model is the source of this problem; I’ve seen anecdotal reports from the official message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them. Discussion around this bug started on the standard DeepSpeech discussion forum: https://discourse.mozilla.org/t/text-produced-has-long-strings-of-words-with-no-spaces/24089/13 https://discourse.mozilla.org/t/longer-audio-files-with-deep-speech/22784/3 - **Have I written custom code (as opposed to running examples on an unmodified clone of the repository)**: The standard `client.py` was slightly modified to segment the longer 15 minute audio clip into 4 second blocks. - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Mac OS X 10.12.6 (16G1036) - **TensorFlow installed from (our builds, or upstream TensorFlow)**: Both Mozilla DeepSpeech and TensorFlow were installed into a virtualenv setup via the following requirements.txt file: ``` tensorflow==1.4.0 deepspeech==0.1.0 numpy==1.13.3 scipy==0.19.1 webrtcvad==2.0.10 ``` - **TensorFlow version (use command below)**: ``` ('v1.4.0-rc1-11-g130a514', '1.4.0') ``` - **Python version**: ``` Python 2.7.13 ``` - **Bazel version (if compiling from source)**: Did not compile from source. - **GCC/Compiler version (if compiling from source)**: Same - **CUDA/cuDNN version**: Used CPU only version - **GPU model and memory**: Used CPU only version - **Exact command to reproduce**: I haven't provided my full modified `client.py` that segments longer audio, but to run with a language model using the standard `deepspeech` command against a known 4 seconds audio clip included in the Dropbox folder shared above you can run the following: ``` # Set $DEEPSPEECH to where full Deep Speech checkout is; note that my own git checkout # for the `deepspeech` runner is at git sha fef25e9ea6b0b6d96dceb610f96a40f2757e05e4 deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt $DEEPSPEECH/models/lm.binary $DEEPSPEECH/models/trie # Similar command to run without language model -- spaces retained for unknown words: deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt ``` This is clearly a bug and not a feature :)

Topic		Replies	Views
Longer audio files with Deep Speech DeepSpeech	12	12031	November 21, 2019
Only gibberish output DeepSpeech	0	428	September 11, 2021
Transcribing longer audio files DeepSpeech	17	2642	February 28, 2023
DeepSpeech Problems with Speech Recognition Using Microphone DeepSpeech issue	12	2170	February 3, 2021
DeepSpeech giving bad results DeepSpeech learning	5	2312	February 11, 2020

DeepSpeech generates long nonsense tokens as output

Related topics