Speed up feedAudioContent for NodeJS lib

Hey guys! First of all thanks to all the community which is amazing! I have found many tutorials and tips.

I have been developing a PTT app which is used to recognize voice commands when a person is holding a button. I use the NodeJs example to start my app. At this point I have a good recognizer but the process time of this method is too long:

  englishModel.feedAudioContent(modelStream, chunk.slice(0, chunk.length / 2));

I generate my own lm and trie with the short commands. I’m using CPU and for chunks with 8KBytes (not 8 MBytes) is using 3.5 seconds. My questions are these:

How many seconds/milliseconds can I improve using a GPU (GTX-2070)?
Can I speed up the time with any command or argument? I did not see anything in the documentation.
The new version 0.7 is faster than the 0.6?
Any tip?

Thanks again!

8 Mbytes of audio in 3.5 seconds ? How much time is that ? I feel like it’s much more than realtime.

Maybe not that much, depending on where the bottleneck is. GPU itself will produce inference much faster, but you still have some CPU-bound operations, plus the memory transfers.

Could you please share more context on:

  • what you are working on,
  • your expectations in processing time
  • have you tried changung the chunk size ? 8MB is really quite a lot compared to what we usually do, especially for streaming.
1 Like
  • It is a simple audio command to a video-game. Such us ‘attack tower’, ‘spawn warrior’…

  • Right now my expectations are to reduce the time to one second or less.

  • I have tried to reduce the chunk size using the mic lib (https://www.npmjs.com/package/mic) with arecord flag --buffer-size=1048 but I do not success. I had to look inside the lib but I think what am doing is not correct:

...
else {
    audioProcess = spawn('arecord', ['-c', channels, '-r', rate, '-f',
                          format, '-D', device, '--buffer-size=1048'], 
                          audioProcessOptions);
}
...

Thanks to answer so fast!

You should likely use a dedicated language model instead of the generic one.

This is unclear. One second of processing for how much audio time ? How much audio do you have in 8MB ?

Can’t help with that, but it’s a raw array, you can process it as you like.

Are you sure about that ? It’s more than 4 minutes of audio. if that’s the case, you are obviously not using the Streaming API correctly.

I too found that feedAudioContent is slow in Nodejs client lib. While trying out this deepspeech node angular UI app I found that CPU usage is quite high during inference… Upon profiling I found that feedAudioContent is taking lot of time. It was little high on mac, but was sucking CPU on Linux.

Which version i tried? 0.6.1

How to reproduce this

Use one of the node examples say - web_microphone_websocket https://github.com/mozilla/DeepSpeech-examples/tree/r0.6/web_microphone_websocket

The CPU usage does go high while inferencing. I think it’s the same feedAudioContent. I found this only in node. I tried python example code, they were not having this issue.

Please be more descriptive of the issue. This is completely not what we experience usign the Streaming API under NodeJS (daily, on RPi4).

So far, the only figures in this threads points to misuse of the API.

I was mistaken :sweat_smile: :sweat_smile:

8KBytes. Sorry

How much audio does that makes ? 8KBytes would be ~250ms, am I right ?

That would be super-slow, not what we have experienced. @duxy1996 Can you share more details / code on what you do ? There’s basically nothing here.

You are right, but it is not the full audio. The audio is received by a callback and each time it is recieving I process with that function. When the last sound arrives, I call the
let text = englishModel.finishStream(modelStream);
and I get the full text.

I followed the example found in the DeepSpeech examples repo. It is very similar. I have tried to get the main methods used by the lib in which I have the problem.

When the mic is capturing information, the event .on("data") is receiving the audio buffer data. When the stream has finished is called the .on("finish"). At this time I decode the audio. DeepSpeech usually works pretty fine.

Usually when I release the PTT button the last words are not in the model and I have to wait an extra time to process the data with the method: englishModel.feedAudioContent.

startMicrophone() {

  var stream = microphone.getAudioStream();

  stream.on('data', (data) => {
    processAudioStream(data, processVoice);
  });

  stream.on('finish', () => {
    predictData();
  });

  microphone.start();
}
function processAudioStream() {
  recordedAudioLength += (chunk.length / 2) * (1 / 16000) * 1000;
  // Slow method
  englishModel.feedAudioContent(modelStream, chunk.slice(0, chunk.length / 2));
}
function predictData() {
    let text = englishModel.finishStream(modelStream);
    console.log(text);
}

If you require more information @lissyx tell me! And thanks for all your help. I have never used this technology and I found it sometimes tricky.

Please ellaborate, we try to make it simple.

I have not verified this example exactly, but you should make sure you are not running audio capture and deepspeech on the same thread, at first.

So you might want to buffer a bit more than 8kB, since it covers only ~250ms. Try chunks of ~1s so 32kB.

But nevertheless, 3.5s for 250ms is way too slow. What’s your hardware ?

I will use Javascript workers. Right now it is executed in the same thread.

I will try that solution. My setup is:

Processor: AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G × 4
Graphics: NV137 (1050Ti 4GiB)
Memory: 15,6 GiB
Base system: Ubuntu 16.04.6 LTS 64-bit

Increasing the chunk to 1 second the results globally are better.

In the mic lib I added the command: '--buffer-time=1000000' Time is in microsec 1e-6

320ms is the optimal chunk size for latency as it matches how often the model can process input. Chunk sizes smaller than that make no difference, it’ll still be processed only after 320ms of audio are accumulated. Chunk sizes larger than that will increase latency.

1 Like

Could you please be more descriptive ? What’s “globally better” ?

What’s your NodeJS version as well ?

That seems quite low end, but it should not be that slow.

Could you please reproduce speed measurement using:

  • nodejs bindings,
  • python bindings
  • C++ bindings

Using this:

$ npm install deepspeech@0.6.1
$ time deepspeech [...]
$ pip3 install deepspeech==0.6.1
$ time deepspeech [...]
$ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/native_client.amd64.cpu.linux.tar.xz && tar xf native_client.amd64.cpu.linux.tar.xz
$ time ./deepspeech [...]

Node JS v 13.11

The time since I start to send chunks to the :
englishModel.feedAudioContent(modelStream, chunk.slice(0, chunk.length / 2));
until I process all the chunks received is lower.