Microphone stream w/ nodejs

Hello everyone,

I saw this gist of running inference on a microphone with python https://gist.github.com/reuben/80d64de15d1f46d34d28c7e83fc5f57e#file-ds_mic-py and I’ve been trying to get this working on node for the past couple of days to no avail.

I’ve tried using the feedAudioContent argument like in the python gist but I keep getting an illegal number of arguments error when I pass in a read stream as the first argument and the mic buffer as the second.

I’ve also tried to get it working like in the example, (https://github.com/mozilla/DeepSpeech/blob/master/native_client/javascript/client.js) and here I’ve gotten slightly farther. I’ve replaced the audio buffer with one sent from the browser recording. converted it to wav, and sent as a BufferArray with the following settings:
codec: {
sampleRate: 16000,
channels: 1,
app: 2048,
frameDuration: 20,
bufferSize: 2048
}

But whenever I do this all I get is either a blank inference result or sometimes I get the letter “h” as a result.

I haven’t modified the original script much. Just, instead of getting a buffer from a wav file using fs I just have the browser send a buffer.

Any help on this would be incredibly appreciated.

Can you make sure it’s being properly converted to WAV PCM 16 bits, 16kHz ? Can you dump the WAV file and perform some offline inference as a reference comparison ? Or even nicer, share it ?

Thanks for the quick reply.

Here’s what I’ve been using to convert to WAV:

        function encodeWAV(samples, sampleRate) { //passing in 16000 as sampleRate
	var buffer = new ArrayBuffer(44 + samples.length * 2);
	var view = new DataView(buffer);
	writeString(view, 0, 'RIFF');
	view.setUint32(4, 32 + samples.length * 2, true);
	writeString(view, 8, 'WAVE');
	writeString(view, 12, 'fmt ');
	view.setUint32(16, 16, true);
	view.setUint16(20, 1, true);
	view.setUint16(22, 1, true);
	view.setUint32(24, sampleRate, true);
	view.setUint32(28, sampleRate * 2, true);
	view.setUint16(32, 2, true);
	view.setUint16(34, 16, true);
	writeString(view, 36, 'data');
	view.setUint32(40, samples.length * 2, true);
	floatTo16BitPCM(view, 44, samples);
	return view;
}

Mostly, just following this guide: https://aws.amazon.com/blogs/machine-learning/capturing-voice-input-in-a-browser/ with some tweaks.

I’ll get an audio exported and test with that ASAP. And then I will share that file here as well. That’s an excellent idea!

On that note however, is there not a way to do this without exporting it as wav file?
The first thing the node script does is take the wav and get a buffer from it so is there way I can just skip that part and send a buffer?

Yes, look at https://github.com/mozilla/DeepSpeech/blob/master/native_client/javascript/client.js#L75-L80 the first thing that is done is calling sox to produce a buffer. So in your case, you could maybe just skip directly to https://github.com/mozilla/DeepSpeech/blob/master/native_client/javascript/client.js#L103

Thanks again for your suggestion.
I think I have a lead. I downloaded the wav file I was generating and it sounds really sped up. So I think that’s where I need to look into next. I’m going to try another method of resampling the microphone sound also.
heres an example file: https://1drv.ms/u/s!AgwGkh2YF_uxiBEYQ27OayztLAaD

Will let you know if I figure it out.

I’m not sure I get what you meant to say here :/. Are you saying it’s faster ? Better ?

Like the example wav file I posted of me saying “testing” (https://1drv.ms/u/s!AgwGkh2YF_uxiBEYQ27OayztLAaD). For lack of better words, the voice sounds like a chipmunk. I’m guessing that I’m doing something wrong when I’m resampling the mic to 16 khz.

You should directly record at 16kHz, not sure what you do, but it might add artifacts that might require other filtering.

I did manage to get it finally working but you’re right I think to get maximum accuracy I would have to record with 16kHz which unfortunately you can’t do that on the browser.
For example I recorded this speech: https://1drv.ms/u/s!AgwGkh2YF_uxiBIQxFn6XDmG6a6B
which is a converted 16 khz 1 channel wav file with the words: “he had a scar across his face” and it got inferred as “he had his carecross the space”. On a positive note it did get simple sentences like “this is a test” completely right.

I did also try using the different output_graph files, I started with rounded.pbmm then just the .pb file and finally tried with the lm.binary file as well. While the results were different I didn’t see an improvement.

Thanks again for all your help. This was a fun experiment. I’ll have to stick with the WebSpeech Api for now.

we got not bad result when testing the patches for WebSpeech API. The architecture is: recording as opus in Firefox, send to a speech-proxy in NodeJS that will convert Opus to WAVE 16kHz and then do the inference using DeepSpeech.

Discrepency like what you expose could come from conversions artifacts or just be a consequence of the model not yet trained enough with non native english speakers?

recording as opus in Firefox, send to a speech-proxy in NodeJS that will convert Opus to WAVE 16kHz

Would you be able to direct me to where I can learn more about this? I managed to get a recording in opus however, I think I’m messing up the buffer somewhere since SOX doesn’t seem to be able to pick it up.

Would this help? https://github.com/mozilla/speech-proxy

1 Like