Availability of pre-trained models

I’ve seen some articles about training an acoustic model yourself. Or using checkpoints to continue training a pre-trained model.

To my understanding the pre-trained model from 0.5.1 is trained based on LibriSpeech which contains clear noise-free american english voice. For productive usage with noise and foreign accents, the 0.5.1 pre-trained model by far isn’t accurate enough. Even when using a custom LM that just contains 20 vocabularies it fails often with judging noise as speech-input or map the speech to a similar sounding word (e.g. beer --> tea).

Are there any pre-trained models based on the CommonVoice dataset? Or generally any english model thats more noise/accent resistant?

Training it myself would probably take weeks on poor hardware and I assume I’m not the first one in need of a more resistant english acoustic model, so I reckon I’d open this for discussion.

Thanks!

That’s right.

That feels weird, my experiment with my poor french accent and / some background noise would show very good result in such a case. Can you elaborate on what you do exactly ?

Training with Common Voice is not enough, unfortunately, for the moment. We are working on making the model more robust to noise, but for accents, we are more dependant on volume data from Common Voice.

I’m also interested in this, I just tried deepspeech for the first time and with the 0.5.1 model, a German accent in light British English gives e.g. “afremov speech” instead of “freedom of speech”.

Suppose there are no accent resistant models, what’s the way to go to? Should I train a model based entirely on the persons accent/manner of speech?

Help spread the word on Common Voice for a wider variety of accents. Help sourcing other meaningful dataset with different accents than American-English.

It does depend on your goal as well.

I stream audio in real time from the WebBrowser in chunks via websocket to a node-js server. Then after I created the stream server-side and push to it whenever a new chunk arrived, all that happens is the same as in the Deepspeech/VAD example.

This is the example-project I created to play around. Once the stream started, the server responds continuously with the detected words. Current master branch should work like a charm. Feel free to try it out… pretty cool it works easy like that from the browser. Check out Readme for very simple setup steps. I used the 0.5.1 pre-trained model and a custom LM.

Can you link to the code that does all of that ?

I built something similar a few weeks ago, to give a workshop. This was working pretty well, even with my french accent, so I’m surprised you are having so many bad experience.

This would be the client side. Check out the initRecordingStart method. The server side you also find in the repository I’ve linked before in ./backend/server.js The important part here is globalWsStream.push(data) which feeds the server-side stream with the received chunks.

I’m sorry, but I cannot really dig into your source code. Could you link where you fetch audio from the user’s device, its transformations and how you push that to the API ?

Oh yea sure:

Get Audio Stream from the client and send send it over websocket:

this.ws = new WebSocket("ws://localhost:3000");

let stream = await navigator.mediaDevices.getUserMedia({audio: true});
let rec = new MediaRecorder(stream);
//emits ondataavailable event every 1000ms
rec.start(1000);
rec.ondataavailable = (e)=>{
  //Only new data since last ondataavailable event is contained here
  this.ws.send(e.data);
} 

Server side receive audio chunk via websocket and push into stream

let stream = new Duplex({
    read(size){
      //push implemented elsewhere (websocket)
    },
    write(){}
});

const ffmpeg = spawn('ffmpeg', [
    '-hide_banner',
	'-nostats',
	'-i', '-',
	'-vn',
	'-acodec', 'pcm_s16le',
	'-ac', 1,
	'-ar', AUDIO_SAMPLE_RATE, // 16000
	'-f', 's16le',
	'pipe:1'
]);

stream.pipe(ffmpeg.stdin);

websocket.on('message', (data) => {        
   stream.push(data);
});

All that happens from there on is equivalent to the Deepspeech example. ffmpeg.stdout is piped to VAD and processed by the model

I’m not so sure about that. Here is what I have:

navigator.mediaDevices.getUserMedia({ audio: { channelCount: 1 }, video: false})
  .then((stream) => {
    console.log("Got stream");
    console.debug(stream.getAudioTracks()[0]);
    console.debug(navigator.mediaDevices.getSupportedConstraints());
    console.debug(stream.getAudioTracks()[0].getConstraints());
    console.debug(stream.getAudioTracks()[0].getSettings());

    console.debug("Create AudioContext");
    var context = new AudioContext();
    source = context.createMediaStreamSource(stream);

    var recLength = 0, recBuffers = [];

    node = context.createScriptProcessor(4096, 1, 1);

    // listen to the audio data, and record into the buffer
    node.onaudioprocess = function(e){
    console.debug("node.onaudioprocess");
      recBuffers.push(e.inputBuffer.getChannelData(0));
      recLength += e.inputBuffer.getChannelData(0).length;
      console.debug("recorded", recLength);
        if (recLength >= (4096*64)) {
          var bufferSize = recBuffers[0].length * recBuffers.length;
          var sendBuffers = new Float32Array(bufferSize);
          for (let e in recBuffers) {
            sendBuffers.set(recBuffers[e], e*recBuffers[e].length);
	  }
          console.debug("buffer", sendBuffers);
          deepSpeechSocket.send(sendBuffers);
          recBuffers = [];
          recLength  = 0;
        }
    }

    source.connect(node);
  })
  .catch((err) => {
    console.log("The following error occurred: " + err);
  });

Basically, the raw output from the browser is float32 at 44.1kHz. I dont see where you specify this in your case. Then I had this code to convert using sox, not ffmpeg:

      bufferToStream(rawAudio).
        pipe(Sox({
          input: {
            bits: 32,
            rate: 44100,
            channels: 1,
            encoding: 'floating-point',
            endian: 'little',
            type: 'raw',
          },
          output: {
            bits: 16,
            rate: 16000,
            channels: 1,
            encoding: 'signed-integer',
            endian: 'little',
            compression: 0.0,
            type: 'wavpcm',
          }
        })).
        pipe(audioStream);

I see but thats just another way of receiving the chunks isn’t it? I mean my code works and if i speak clearly with no noise i get the right result…but just when its a little noisy and a word sounds similar it is not accurate enough. How do you reckon this could be the reason for it?

Well, until I see exactly what you are doing, it’s hard to tell …
In fact, I’m still unsure about what you get out of MediaRecorder. The doc mentions you should give it parameters to specify what you want, but you give none: https://developer.mozilla.org/en-US/docs/Web/API/MediaStream_Recording_API

So I have no idea what you give to ffmpeg.

Define “a little noisy”. As I said in the beginning, we know the currently available models are not robust to noise. Also, maybe your accent gets into the way ?

ffmpeg receives the default webm/audio 44100kHz stream that getUserMedia returns. basically you just tell ffmpeg what it should encode the stream into without having to care about the input. And since these parameters were used in the Deepspeech/VAD example, thats what I went for.

I pass the AudioStream into the MediaRecorder, that should be the way to go…

When people in the same room talk while i speak directly into the microphone it also tries to translate their sound into text which results in unwanted responses. But that is clear to me that i works way better for clear, noise-free audio. I guess if you’d run the project you would see for yourself what I mean when I say its inaccurate. I had good english speakers try it but it would still miss a word sometimes.

But I see that you can’t dig into other peoples source code. Just in case you’re interested in my implementation you could try it out. Maybe its just the model and my accent after all.

There isn’t a way to “decrease” the sensibility is there? Like don’t react until a certain loudness is reached

onDataAvailable gives me a audio/webm;codecs=opus blob every 1000ms from the MediaRecorder

Well, that’s expected with the current model, sadly.

I can’t dig and learn everybody’s app to diagnose. Well, I can, but then I’m not able to work on the other things that needs to be worked on.

This would need to be done before passing the audio to deepspeech, I fear. Tuning VAD might help ?

Have you tried pushing some noise cancellation for example, in the getUserMedia request ?

Ok, so likely you have the proper headers being passed into the stream for ffmpeg to convert.

I’d think so.

Absolutely! You are already being a great help out here, thanks!

I don’t think I understand what you mean

I think you can pass MediaConstraints, and this covers noise cancelation

1 Like