DeepSpeech benchmarking / Shorten inference time

Do you plan implementing an optimised / reduced model for faster speed - similar to Google’s approach

Current pre-trained model inference time, on my Macbook Pro 13" early 2015 (I5 8GB RAM) takes 25 sec for a 9 sec audio (mono, 16Khz). I do a POC for a live STT and such delay is not ok for such use-case (of course I can add a GPU but I wonder how sustainable is that on the long run).

Maybe there is also an optimisation issue on my machine - I would propose to post your inference times on various machines in order to set some targets.

Yes, we are working on that :). 25 secs for 9 sec of audio seems a bit high, but might not be unexpected depending on your exact CPU specs. There is also the language model that is applied after audio decoding that consumes some CPU.

Timing above is the duration of mPriv->session->Run (decode goes next) - but that’s indeed dependant on processor (I’ll be interested to know how should it take on decent, up-to-date consumer hardware - like 2017 MacBook Pro 13 / 17 (I5 - I7 proc).

What you might want to have a look into is using the C++ binary, it has a -t argument to provide some runtime info. But I think we might want to change that to have more control and decompose inference / decoding ?

i.e., it’d be interesting if you compare a -t call with and without the language model (lm and trie parameters are optional).


I also tried to make the inference faster. Of course running it on GPU helped a lot.
CPU: 17 sec
GPU: 3 sec
for ~1 sec audio file

But 3 seconds is still a lot of time. What I found out is, that if I run multiple inferences after each other, only the first one needs that much time, the following ones only needed about half a second. So that would work for a continous speech recognition.

@johannes.selymes How do you measure time ? It seems like it does cover loading of the files, which are big :slight_smile:

No I just use the -t option of the
I change the source file to just load another file after the first one, and then run the LocalDsSTT(…) again on the new buffer.
I did not measure the time of the loading and the sox stuff…

./native_client/deepspeech modelsPretrained/output_graph.pb ~/recordings/yes01.wav data/alphabet.txt modelsPretrained/lm.binary modelsPretrained/trie -t
2018-02-13 09:48:36.610077: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-13 09:48:36.610543: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: Quadro M1000M major: 5 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 2.66GiB
2018-02-13 09:48:36.610559: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0)
cpu_time_overall=6.43293 cpu_time_mfcc=0.00152 cpu_time_infer=6.43142
cpu_time_overall=0.43462 cpu_time_mfcc=0.00166 cpu_time_infer=0.43296
cpu_time_overall=0.25265 cpu_time_mfcc=0.00113 cpu_time_infer=0.25152
cpu_time_overall=0.26726 cpu_time_mfcc=0.00106 cpu_time_infer=0.26620

It’s interesting, can you share your changes ?

Yes, but it some kind of a hack :wink:

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <math.h>
#include <string.h>
#include <sox.h>
#include <time.h>
#include "deepspeech.h"
#ifdef __APPLE__
#include <unistd.h>

#define N_CEP 26
#define N_CONTEXT 9
#define BEAM_WIDTH 500
#define LM_WEIGHT 1.75f
#define WORD_COUNT_WEIGHT 1.00f

using namespace DeepSpeech;

struct ds_result {
  char* string;
  double cpu_time_overall;
  double cpu_time_mfcc;
  double cpu_time_infer;

// DsSTT() instrumented
struct ds_result*
LocalDsSTT(Model& aCtx, const short* aBuffer, size_t aBufferSize,
           int aSampleRate)
  float* mfcc;
  struct ds_result* res = (struct ds_result*)malloc(sizeof(struct ds_result));
  if (!res) {
    return NULL;

  clock_t ds_start_time = clock();
  clock_t ds_end_mfcc = 0, ds_end_infer = 0;
  int n_frames = 0;
  aCtx.getInputVector(aBuffer, aBufferSize, aSampleRate, &mfcc, &n_frames);
  ds_end_mfcc = clock();

  res->string = aCtx.infer(mfcc, n_frames);
  ds_end_infer = clock();


  res->cpu_time_overall =
    ((double) (ds_end_infer - ds_start_time)) / CLOCKS_PER_SEC;
  res->cpu_time_mfcc =
    ((double) (ds_end_mfcc  - ds_start_time)) / CLOCKS_PER_SEC;
  res->cpu_time_infer =
    ((double) (ds_end_infer - ds_end_mfcc))   / CLOCKS_PER_SEC;

  return res;

int soxReadAudioFile(char* filename,  char* &buffer,  size_t &buffer_size, int &sampleRate) {

  sox_format_t* input = sox_open_read(filename, NULL, NULL, NULL);

  sampleRate = (int)input->signal.rate;

  // Resample/reformat the audio so we can pass it through the MFCC functions
  sox_signalinfo_t target_signal = {
      SOX_UNSPEC, // Rate
      1, // Channels
      16, // Precision
      SOX_UNSPEC, // Length
      NULL // Effects headroom multiplier

  sox_encodinginfo_t target_encoding = {
    SOX_ENCODING_SIGN2, // Sample format
    16, // Bits per sample
    0.0, // Compression factor
    sox_option_default, // Should bytes be reversed
    sox_option_default, // Should nibbles be reversed
    sox_option_default, // Should bits be reversed (pairs of bits?)
    sox_false // Reverse endianness

#ifdef __APPLE__
  // It would be preferable to use sox_open_memstream_write here, but OS-X
  // doesn't support POSIX 2008, which it requires. See Issue #461.
  // Instead, we write to a temporary file.
  char* output_name = tmpnam(NULL);
  sox_format_t* output = sox_open_write(output_name, &target_signal,
                                        &target_encoding, "raw", NULL, NULL);
  sox_format_t* output = sox_open_memstream_write(&buffer, &buffer_size,
                                                  "raw", NULL);


  // Setup the effects chain to decode/resample
  char* sox_args[10];
  sox_effects_chain_t* chain =
    sox_create_effects_chain(&input->encoding, &output->encoding);

  sox_effect_t* e = sox_create_effect(sox_find_effect("input"));
  sox_args[0] = (char*)input;
  assert(sox_effect_options(e, 1, sox_args) == SOX_SUCCESS);
  assert(sox_add_effect(chain, e, &input->signal, &input->signal) ==

  e = sox_create_effect(sox_find_effect("channels"));
  assert(sox_effect_options(e, 0, NULL) == SOX_SUCCESS);
  assert(sox_add_effect(chain, e, &input->signal, &output->signal) ==

  e = sox_create_effect(sox_find_effect("output"));
  sox_args[0] = (char*)output;
  assert(sox_effect_options(e, 1, sox_args) == SOX_SUCCESS);
  assert(sox_add_effect(chain, e, &input->signal, &output->signal) ==

  // Finally run the effects chain
  sox_flow_effects(chain, NULL, NULL);

  // Close sox handles

#ifdef __APPLE__
  size_t buffer_size = (size_t)(output->olength * 2);
  char* buffer = (char*)malloc(sizeof(char) * buffer_size);
  FILE* output_file = fopen(output_name, "rb");
  assert(fread(buffer, sizeof(char), buffer_size, output_file) == buffer_size);
  return 0;

main(int argc, char **argv)
  if (argc < 4 || argc > 7) {
    printf("Usage: deepspeech MODEL_PATH AUDIO_PATH ALPHABET_PATH [LM_PATH] [TRIE_PATH] [-t]\n");
    printf("  MODEL_PATH\tPath to the model (protocol buffer binary file)\n");
    printf("  AUDIO_PATH\tPath to the audio file to run"
           " (any file format supported by libsox)\n");
    printf("  ALPHABET_PATH\tPath to the configuration file specifying"
           " the alphabet used by the network.\n");
    printf("  LM_PATH\tOptional: Path to the language model binary file.\n");
    printf("  TRIE_PATH\tOptional: Path to the language model trie file created with"
           " native_client/generate_trie.\n");
    printf("  -t\t\tRun in benchmark mode, output mfcc & inference time\n");
    return 1;

  // Initialise DeepSpeech
  Model ctx = Model(argv[1], N_CEP, N_CONTEXT, argv[3], BEAM_WIDTH);

  if (argc > 5) {
    ctx.enableDecoderWithLM(argv[3], argv[4], argv[5], LM_WEIGHT, WORD_COUNT_WEIGHT, VALID_WORD_COUNT_WEIGHT);

  // Initialise SOX
  assert(sox_init() == SOX_SUCCESS);

  char* buffer;
  size_t buffer_size = 0;
  int sampleRate = 0;
  soxReadAudioFile(argv[2], buffer, buffer_size, sampleRate);

  // Pass audio to DeepSpeech
  // We take half of buffer_size because buffer is a char* while
  // LocalDsSTT() expected a short*
  struct ds_result* result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "


// read 2nd file for testing
 soxReadAudioFile("/home/johsely/recordings/accept01.wav",buffer,buffer_size, sampleRate);

 result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "


// read 3rd file for testing
 soxReadAudioFile("/home/johsely/recordings/test01.wav",buffer,buffer_size, sampleRate);

 result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "


// read 4th file for testing
 soxReadAudioFile("/home/johsely/recordings/check01.wav",buffer,buffer_size, sampleRate);

 result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "



  // Deinitialise and quit

  return 0;

Right, I made that in a bit more cleaner way, here are my results on a GTX1080:

./deepspeech ~/tmp/deepspeech/models/output_graph.pb ~/tmp/deepspeech/audio/ ~/tmp/deepspeech/models/alphabet.txt -t 2>&1
2018-02-13 13:56:09.560697: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-13 13:56:09.680074: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-13 13:56:09.680295: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.797
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.55GiB
2018-02-13 13:56:09.680308: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Running on directory /home/alexandre/tmp/deepspeech/audio/
> /home/alexandre/tmp/deepspeech/audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=1.89433 cpu_time_mfcc=0.00241 cpu_time_infer=1.89192
> /home/alexandre/tmp/deepspeech/audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=0.47934 cpu_time_mfcc=0.00316 cpu_time_infer=0.47618
> /home/alexandre/tmp/deepspeech/audio//8455-210777-0068.wav
your powr is sufficient i said
cpu_time_overall=0.47399 cpu_time_mfcc=0.00317 cpu_time_infer=0.47081

Could be great if you can reproduce my steps (using v0.1.1 model and audio files, to compare anything comparable), from my branch:

Instead of passing one WAV file as parameter, just pass a directory containing severals. So in our case, the audio directory extracted from

Came back with a new set of tests - new hardware (Intel® Core™ i7-6600U CPU @ 2.60GHz × 4, RAM 16GB Ubuntu 16.04LTS). Tensorflow 1.4.1, Deepspeech pretrained set ver 0.1.1

8.5 sec audio -> inference time 12.8 sec (using the below - look for VIRGIL12FEB2018 - see especially session inter_op_parallelism_threads - without those changes inference is 19 sec).

I hope my reply is not too chaotic :slight_smile: - didn’t knew how to attach code


#include “third_party/eigen3/unsupported/Eigen/CXX11/Tensor”

#include “native_client/deepspeech_model_core.h” // generated


#include “deepspeech.h”
#include “deepspeech_utils.h”
#include “alphabet.h”
#include “beam_search.h”

#include “tensorflow/core/public/session.h”
#include “tensorflow/core/platform/env.h”

#define BATCH_SIZE 1

using namespace tensorflow;
using tensorflow::ctc::CTCBeamSearchDecoder;
using tensorflow::ctc::CTCDecoder;

namespace DeepSpeech {

class Private {
Session* session;
GraphDef graph_def;
int ncep;
int ncontext;
Alphabet* alphabet;
KenLMBeamScorer* scorer;
int beam_width;
bool run_aot;

Model::Model(const char* aModelPath, int aNCep, int aNContext,
const char* aAlphabetConfigPath, int aBeamWidth)
mPriv = new Private;
mPriv->session = NULL;
mPriv->scorer = NULL;
mPriv->ncep = aNCep;
mPriv->ncontext = aNContext;
mPriv->alphabet = new Alphabet(aAlphabetConfigPath);
mPriv->beam_width = aBeamWidth;
mPriv->run_aot = false;

if (!aModelPath || strlen(aModelPath) < 1) {
std::cerr << “No model specified, will rely on built-in model.” << std::endl;
mPriv->run_aot = true;

tensorflow::SessionOptions options;
tensorflow::ConfigProto & config = options.config;
Status status = NewSession(options, &mPriv->session);

//Status status = NewSession(SessionOptions(), &mPriv->session);
if (!status.ok()) {
std::cerr << status.ToString() << std::endl;

status = ReadBinaryProto(Env::Default(), aModelPath, &mPriv->graph_def);
if (!status.ok()) {
mPriv->session = NULL;
std::cerr << status.ToString() << std::endl;

status = mPriv->session->Create(mPriv->graph_def);
if (!status.ok()) {
mPriv->session = NULL;
std::cerr << status.ToString() << std::endl;

for (int i = 0; i < mPriv->graph_def.node_size(); ++i) {
NodeDef node = mPriv->graph_def.node(i);
if ( == “logits/shape/2”) {
int final_dim_size = node.attr().at(“value”).tensor().int_val(0) - 1;
if (final_dim_size != mPriv->alphabet->GetSize()) {
std::cerr << "Error: Alphabet size does not match loaded model: alphabet "
<< "has size " << mPriv->alphabet->GetSize()
<< ", but model has " << final_dim_size
<< " classes in its output. Make sure you’re passing an alphabet "
<< “file with the same size as the one used for training.”
<< std::endl;
mPriv->session = NULL;

if (mPriv->session) {

delete mPriv->alphabet;
delete mPriv->scorer;

delete mPriv;

Model::enableDecoderWithLM(const char* aAlphabetConfigPath, const char* aLMPath,
const char* aTriePath, float aLMWeight,
float aWordCountWeight, float aValidWordCountWeight)
mPriv->scorer = new KenLMBeamScorer(aLMPath, aTriePath, aAlphabetConfigPath,
aLMWeight, aWordCountWeight, aValidWordCountWeight);

Model::getInputVector(const short* aBuffer, unsigned int aBufferSize,
int aSampleRate, float** aMfcc, int* aNFrames,
int* aFrameLen)
return audioToInputVector(aBuffer, aBufferSize, aSampleRate, mPriv->ncep,
mPriv->ncontext, aMfcc, aNFrames, aFrameLen);

Model::decode(int aNFrames, float*** aLogits)
const int batch_size = BATCH_SIZE;
const int top_paths = 1;
const int timesteps = aNFrames;
const size_t num_classes = mPriv->alphabet->GetSize() + 1; // +1 for blank

// Raw data containers (arrays of floats, ints, etc.).
int sequence_lengths[batch_size] = {timesteps};

// Convert data containers to the format accepted by the decoder, simply
// mapping the memory from the container to an Eigen::ArrayXi,::MatrixXf,
// using Eigen::Map.
Eigen::Map seq_len(&sequence_lengths[0], batch_size);
std::vector<Eigen::Map> inputs;
for (int t = 0; t < timesteps; ++t) {
inputs.emplace_back(&aLogits[t][0][0], batch_size, num_classes);

// Prepare containers for output and scores.
// CTCDecoder::Output is std::vector<std::vector>
std::vectorCTCDecoder::Output decoder_outputs(top_paths);
for (CTCDecoder::Output& output : decoder_outputs) {
float score[batch_size][top_paths] = {{0.0}};
Eigen::MapEigen::MatrixXf scores(&score[0][0], batch_size, top_paths);

if (mPriv->scorer == NULL) {
CTCBeamSearchDecoder<>::DefaultBeamScorer scorer;
CTCBeamSearchDecoder<> decoder(num_classes,
decoder.Decode(seq_len, inputs, &decoder_outputs, &scores).ok();
} else {
CTCBeamSearchDecoder decoder(num_classes,
decoder.Decode(seq_len, inputs, &decoder_outputs, &scores).ok();

// Output is an array of shape (1, n_results, result_length).
// In this case, n_results is also equal to 1.
size_t output_length = decoder_outputs[0][0].size() + 1;

size_t decoded_length = 1; // add 1 for the \0
for (int i = 0; i < output_length - 1; i++) {
int64 character = decoder_outputs[0][0][i];
const std::string& str = mPriv->alphabet->StringFromLabel(character);
decoded_length += str.size();

char* output = (char*)malloc(sizeof(char) * decoded_length);
char* pen = output;
for (int i = 0; i < output_length - 1; i++) {
int64 character = decoder_outputs[0][0][i];
const std::string& str = mPriv->alphabet->StringFromLabel(character);
strncpy(pen, str.c_str(), str.size());
pen += str.size();
*pen = ‘\0’;

for (int i = 0; i < timesteps; ++i) {
for (int j = 0; j < batch_size; ++j) {

return output;

Model::infer(float* aMfcc, int aNFrames, int aFrameLen)
const int batch_size = BATCH_SIZE;
const int timesteps = aNFrames;
const size_t num_classes = mPriv->alphabet->GetSize() + 1; // +1 for blank

const int frameSize = mPriv->ncep + (2 * mPriv->ncep * mPriv->ncontext);

float*** input_data_mat = (float***)calloc(timesteps, sizeof(float**));
for (int i = 0; i < timesteps; ++i) {
input_data_mat[i] = (float**)calloc(batch_size, sizeof(float*));
for (int j = 0; j < batch_size; ++j) {
input_data_mat[i][j] = (float*)calloc(num_classes, sizeof(float));

if (mPriv->run_aot) {
Eigen::ThreadPool tp(2); // Size the thread pool as appropriate.
Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());

nativeModel nm(nativeModel::AllocMode::RESULTS_AND_TEMPS_ONLY);

for (int ot = 0; ot < timesteps; ot += DS_MODEL_TIMESTEPS) {
  nm.set_arg0_data(&(aMfcc[ot * frameSize]));

  // The CTCDecoder works with log-probs.
  for (int t = 0; t < DS_MODEL_TIMESTEPS, (ot + t) < timesteps; ++t) {
    for (int b = 0; b < batch_size; ++b) {
      for (int c = 0; c < num_classes; ++c) {
        input_data_mat[ot + t][b][c] = nm.result0(t, b, c);

std::cerr << “No support for native model built-in.” << std::endl;
return NULL;
} else {
if (aFrameLen == 0) {
aFrameLen = frameSize;
} else if (aFrameLen < frameSize) {
std::cerr << "mfcc features array is too small (expected " <<
frameSize << ", got " << aFrameLen << “)\n”;
return NULL;

Tensor input(DT_FLOAT, TensorShape({1, aNFrames, frameSize}));

auto input_mapped = input.tensor<float, 3>();
for (int i = 0, idx = 0; i < aNFrames; i++) {
  for (int j = 0; j < frameSize; j++, idx++) {
    input_mapped(0, i, j) = aMfcc[idx];
  idx += (aFrameLen - frameSize);

Tensor n_frames(DT_INT32, TensorShape({1}));
n_frames.scalar<int>()() = aNFrames;

clock_t begin = clock();
std::cout << "------>Start mPriv->session->Run\n";

// The CTC Beam Search decoder takes logits as input, we can feed those from
// the "logits" node in official models or
// the "logits_output_node" in old AOT hacking models
std::vector<Tensor> outputs;
Status status = mPriv->session->Run(
  {{ "input_node", input }, { "input_lengths", n_frames }},
  {"logits"}, {}, &outputs);

std::cout << "------>Done mPriv->session->Run (decoding goes next): " << double(clock() - begin)  / CLOCKS_PER_SEC << "\n";
begin = clock();

// If "logits" doesn't exist, this is an older graph. Try to recover.
if (status.code() == tensorflow::error::NOT_FOUND) {
  status = mPriv->session->Run(
    {{ "input_node", input }, { "input_lengths", n_frames }},
    {"logits_output_node"}, {}, &outputs);

if (!status.ok()) {
  std::cerr << "Error running session: " << status.ToString() << "\n";
  return NULL;

auto logits_mapped = outputs[0].tensor<float, 3>();
// The CTCDecoder works with log-probs.
for (int t = 0; t < timesteps; ++t) {
  for (int b = 0; b < batch_size; ++b) {
    for (int c = 0; c < num_classes; ++c) {
      input_data_mat[t][b][c] = logits_mapped(t, b, c);
std::cout << "------>Done decoding: " << double(clock() - begin)  / CLOCKS_PER_SEC << "\n";


return decode(aNFrames, input_data_mat);

Model::stt(const short* aBuffer, unsigned int aBufferSize, int aSampleRate)
float* mfcc;
char* string;
int n_frames;

getInputVector(aBuffer, aBufferSize, aSampleRate, &mfcc, &n_frames, NULL);
string = infer(mfcc, n_frames);
return string;


Your result seems now more in-par with what we experience. I’m susprised you had to play with sessions options to trigger threads, I do see threads working here.

@lissyx I can maybe if I have time.
But I think we both showed that the second and following inferences are working way faster then the first one. At least fast enough for me now :slight_smile:

Yeah, thanks for noticing, I would not have bet on such a difference.

I confirm Johannes observation - first inference is slower, next ones are going faster.
Did a brief test on the machine with above specs on live (used a custom auditok) STT and it works quite satisfactory (speed, accuracy vs noise). I would say decently close to google offline:-)

I’ve had a closer look, and as I expected it’s mostly coming from the loading of the protocolbuffer file. Switching to TensorFlow’s MemmappedEnv loading helps.