Okay, I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.
I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of my full 15 minute audio file:
The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.
Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):
Running inference for chunk 1
so were trying again a maybeialstart this time
Running inference for chunk 2
omiokaarforfthelastquarterwastoget
Running inference for chunk 3
to car to state deloedmarchinstrumnalha
Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat
Running inference for chunk 5
i am a to do that you know
Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing
Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape
Running inference for chunk 8
out from sir handler and i are on new
Running inference for chunk 9
he is not monolithic am andthanducotingswrat
Running inference for chunk 10
relizationutenpling paws on that until it its a product signal
Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):
Running inference for chunk 1
so we're tryng again ah maybe alstart this time
Running inference for chunk 2
omiokaar forf the last quarter was to get
Running inference for chunk 3
oto car to state deloed march in strumn alha
Running inference for chunk 4
um ton product caser egauges somd produc sidnel from that
Running inference for chunk 5
am ah to do that ou nowith
Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga
Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha
Running inference for chunk 8
rout frome sir hanler and ik ar on newh
Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat
Running inference for chunk 10
relization u en pling a pas on that until it its a product signal
I’ve included both these files in the shared Dropbox folder link above.
Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):
So, we're trying again, maybe I'll start this time.
So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill
processing, we have some tech debt that we would need to do to split the CAPE
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.
This shows the language model is the source of this problem; I’ve seen anecdotal reports from this message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.