Is mode.intermediateDecode() good practice or not?

If I have a finished recording at hand and wanna get the transcription in case of Javascript I simply do:
result = model.stt(buffer.slice(0, buffer.length / 2))

If I use the streaming Interface it looks like this:

let stream =  model.createStream()
model.feedAudioContent(stream, buffer.slice(0, buffer.length / 2));
let result = model.intermediateDecode(stream);

The second & third line obviously happen as long as there come in chunks of audio-data from another streaming interface like a websocket-connection. And when the stream is done I get the final transcription with mode.finishStream(stream)

The docs for intermediateDecode note that this is still a very expensive task, but in this completed and merged issue its mentioned that the decoder is already capable of streaming, while the docs say it isn’t.

Whats right, and is it good practice to use intermediateDecode to show the user whats currently being recognized?

I don’t know why the “latest” link on readthedocs is outdated. The v0.6.1 docs have that bit removed as you noticed. It is not expensive: https://deepspeech.readthedocs.io/en/v0.6.1/NodeJS-API.html#Model.intermediateDecode

1 Like

Didn’t notice the different versions for the docs - Thanks!

While we are at it, I can’t find any information on why all of the user-contributed examples only use half the buffer for transcription but it still works perfectly fine. Doesn’t make sense to me to only take the first half.

For example:
model.stt(audioBuffer.slice(0, audioBuffer.length / 2))

Maybe for the future docs you can add an explanation for this?

The NodeJS Buffer API does not allow you to specify the size of the individual elements, it’s always a single byte. But the DeepSpeech code expects an input with 16-bits per sample, and treats it as such. Because of this, you have to divide the length by 2 to get from the number of bytes to the number of 16-bits samples.

We should probably make the API take a Buffer object directly and hide that length adjustment in the language binding code.

1 Like

Or even better, take an Uint16Array instead, if that’s possible.

1 Like

Actually, looking closely at it, this is even more sketchy than I was thinking. We’re counting on the fact that .slice returns an object that shares the underlying memory from the original object. We should definitely improve this.

1 Like

I understand that this basically means that a Buffer like [64,89,12,187] would contain 2, not 4 samples, and to get the number of samples buffer.length/2 would be the right calculation, but wouldn’t only using the first half of the entire buffer for inference cut away that second half of important data?

The example above would now only contain [64,89] and complete loose the second half.

What am I missing here?

On your note about hiding this, I’d love to see this for other stuff aswell. Like Hiding that “magic numbers” LM_ALPHA, LM_BETA, BEAM_WIDTH . Maybe giving this a default value and make it optionally configurable when creating the model is a good Idea?

You’re missing the implementation detail that .slice does not create a new underlying array, it just returns an object with adjusted start and end pointers. So we’re reading the entire buffer, even though it looks like half of it is being thrown away.

Yes, we’re making all of those magic numbers optional soon. There’s already a PR improving the language model API, and I’m currently working on doing the same for the beam width.

Makes sense!

And this is because of the fact that when DeepSpeech gets that sliced Array with modified pointers it believes for [64,89] that there are 2 entries. And because DeepSpeech believes one entry is 16 byte, it reads in 2x16 bytes starting at the begin-pointer of the sliced array and continues reading “over the end” of that sliced array where the rest of the buffer still exists on storage from the “original” buffer?
That sounds as you mentioned quite sketchy xD

Sounds great!

Exactly. I’ll use the opportunity of all the API changes to fix this as well.

Fix in https://github.com/mozilla/DeepSpeech/pull/2693 FWIW.

1 Like

That’s some quick process from identifying a problem to fixing it right away :wink: