DeepSpeech sox FAIL formats error when transcribing .wav

I am trying to set up a transcription server in Node.JS, following this example https://github.com/mozilla/DeepSpeech-examples/blob/r0.6/nodejs_wav/index.js
My client takes audio from a MediaRecorder and sends it as an ArrayBuffer to the server through Socket.IO

sendAudioToTranscriptionServer() {
        let blob = new Blob(recordedChunks, { 'type' : 'audio/wav' });
        blob.arrayBuffer().then(ab => {
            ioClient.emit("transcribable-audio", ab);
        });
    }

The server then tries to transcribe the audio:

socket.on("transcribable-audio", (arrayBuffer) => {
	//let buffer = Buffer.from(new Uint8Array(arrayBuffer));
    	let buffer = new Uint8Array(arrayBuffer);
	let audioStream = new MemoryStream();
	try {
    	bufferToStream(buffer).
    	pipe(Sox({
	    global: {
	 	'no-dither': true,
	    },
	    output: {
	    	bits: 16,
	    	rate: desiredSampleRate,
	    	channels: 1,
	    	encoding: 'signed-integer',
	    	endian: 'little',
	    	compression: 0.0,
	    	type: 'raw'
	    }
    	})).
        pipe(audioStream);
	    console.log('pipe created');
    	audioStream.on('finish', () => {
	        let audioBuffer = audioStream.toBuffer();
	        const audioLength = (audioBuffer.length / 2) * (1 / desiredSampleRate);
	        console.log('audio length', audioLength);
	        let result = model.stt(audioBuffer.slice(0, audioBuffer.length / 2));
	        console.log('result:', result);
	        socket.emit("transcription-received", result);
    	});
    } catch (ex) {
	    console.error(ex.message);
	}
});

But I keep getting the “unhandled stream error in pipe”
Error: sox FAIL formats

What could be the reason for this? How can I get more information on the error?

I don’t see any input description in your sox call ?

Here is my code doing something similar:

        var audioStream = new MemoryStream();
        bufferToStream(rawAudio).
          pipe(Sox({
            input: {
              bits: 32,
              rate: 44100,
              channels: 1,
              encoding: 'floating-point',
              endian: 'little',
              type: 'raw',
            },
            output: {
              bits: 16,
              rate: sampleRate,
              channels: 1,
              encoding: 'signed-integer',
              endian: 'little',
              compression: 0.0,
              type: 'wavpcm',
            }
          })).
          pipe(audioStream);

Input type my vary, but your output seems also a bit strange, type: 'raw' for instance.

Oh, sorry, I just found out I was recording video instead of audio in my MediaRecorder. So it couldn’t convert it to wav
I also found out that MediaRecorder doesn’t support audio/wav mime type, only audio/webm. Can DeepSpeech process it?

No, but you can still send audio/webm and have sox process it I guess? I use lower-level audio access and just get raw FP from the browser, so that’s not very much different …

I still cannot get rid of this error (sox FAIL formats) no matter what I do
I transformed the recorded blobs from the MediaRecorder to wav and sent them to the server
I added an input description to my sox call

bufferToStream(buffer).
    	pipe(Sox({
	    global: {
	 	'no-dither': true,
	    },
	    input: {
	        bits: 32,
		rate: 44100,
		channels: 1,
		encoding: 'floating-point',
		endian: 'little',
		type: 'wav',
	    },
	    output: {
	    	bits: 16,
	    	rate: desiredSampleRate,
	    	channels: 1,
	    	encoding: 'signed-integer',
	    	endian: 'little',
	    	compression: 0.0,
	    	type: 'wavpcm'
	    }
    	})).
    	pipe(audioStream);

How can I check what kind of formatting discrepancy I have?

Without more informations from sox it’s hard …

Should that be raw instead of wav ?

Have you tried dumping the audio feed before passing it to sox and try to call manually from command line ?

Are you sure sox is running properly ?

Did you try the web microphone example? It does transcription through socket.io/nodejs probably in the way you want.

You actually don’t need to use sox at all, it’s better to downsample the audio in the client before sending it through socket.io because less processing is on the server and then sox isn’t needed.

Could you elaborate on your example here? I can’t find exactly in your code. Here, we get raw audio out of the browser, it’s not just down-sampling, it’s actual conversion from float 32 bits to wave pcm signed 16 bits :slight_smile:

In the web microphone example, a web worker is used to downsample from float 32 (recorded in the browser) to pcm 16 bit:

This is the code for the web worker, i didn’t write actually, it’s from another speech / wake word project called Porcupine:

https://github.com/mozilla/DeepSpeech-examples/blob/r0.6/web_microphone_websocket/public/downsampling_worker.js

The web worker is used by this code:

https://github.com/mozilla/DeepSpeech-examples/blob/r0.6/web_microphone_websocket/src/App.js

createAudioProcessor(audioContext, audioSource) {
	let processor = audioContext.createScriptProcessor(4096, 1, 1);
	
	const sampleRate = audioSource.context.sampleRate;
	
	let downsampler = new Worker(DOWNSAMPLING_WORKER);
	downsampler.postMessage({command: "init", inputSampleRate: sampleRate});
	downsampler.onmessage = (e) => {
		if (this.socket.connected) {
			this.socket.emit('stream-data', e.data.buffer);
		}
	};
	
	processor.onaudioprocess = (event) => {
		var data = event.inputBuffer.getChannelData(0);
		downsampler.postMessage({command: "process", inputFrame: data});
	};
	
	processor.shutdown = () => {
		processor.disconnect();
		this.onaudioprocess = null;
	};
	
	processor.connect(audioContext.destination);
	
	return processor;
}

In particular, this is the code that receives the 16bit integer data and sends it to the socket.io server:

downsampler.onmessage = (e) => {
		if (this.socket.connected) {
			this.socket.emit('stream-data', e.data.buffer);
		}
	};

In the server code that data needs not be further processed - sox isn’t required because it’s already in the right format.

Right, yes, that’s one other solution, but it requires a bit more work than just leaving the work to sox. Though, you say it’s downsampling when you are actually performing conversion.

Nothing complicated in itself, but when you try to keep examples simples, it’s nice to avoid it.

Maybe @Oleksii_Davydenko should try it to eliminate the risk that sox is broken somehow.

I did write a JavaScript microphone / hotword / downsampling library which does all this conversion/downsampling pretty cleanly. BumbleBee includes that downsampler library along with Porcupine.

https://github.com/jaxcore/bumblebee-hotword

const BumbleBee = require('bumblebee-hotword');

let bumblebee = new BumbleBee();
bumblebee.setWorkersPath('/bumblebee-workers');
bumblebee.addHotword('bumblebee');

bumblebee.on('data', function(data) {
	// DATA TO SEND TO DEEPSPEECH
});

bumblebee.start();

I’m still not sure why this is caused but I struggled with this a lot aswell and ended up using ffmpeg to transcode the audio server side.

(The problem might be caused by the fact that you create a BLOB with a audio/wav header when it in fact is audio/webm since the MediaRecording API can’t create wav - and sox if I’m not mistaken can’t handle audio/webm, not too sure about that though)

However the solution introduced by Dan:

… is another good Idea since this also reduces network traffic with a smaller Sample-rate.

I actually use both solutions:

  1. Downsampling in the frontend for finished recordings (record before sending to server)
  2. Downsampling server-side for continuously streaming and transcribing the audio in real time

If you wanna play around I’ve created yet another example-app which lets you experiment with both + choose different language models (WIP!)

Gitlab

I don’t want to diverge too much, but that’s a nice exampled. Though, I see you are still on 0.6.0, you should move onto 0.6.1 that has nice bugfixes (the model did not changed, only lib + exported tflite model).

1 Like