DeepSpeech benchmarking / Shorten inference time


(Nastasev) #1

Hi guys,

You are doing a terrific job here - congrats!!

Do you plan implementing an optimised / reduced model for faster speed - similar to Google’s approach https://arxiv.org/pdf/1603.03185.pdf?

Current pre-trained model inference time, on my Macbook Pro 13" early 2015 (I5 8GB RAM) takes 25 sec for a 9 sec audio (mono, 16Khz). I do a POC for a live STT and such delay is not ok for such use-case (of course I can add a GPU but I wonder how sustainable is that on the long run).

Maybe there is also an optimisation issue on my machine - I would propose to post your inference times on various machines in order to set some targets.

Thanks a lot and keep up the great job you’re doing!
Virgil


(Lissyx) #2

Yes, we are working on that :). 25 secs for 9 sec of audio seems a bit high, but might not be unexpected depending on your exact CPU specs. There is also the language model that is applied after audio decoding that consumes some CPU.


(Nastasev) #3

Happy to hear you’re working on optimised model - if you need testing hands - let me know :slight_smile:

Timing above is the duration of mPriv->session->Run (decode goes next) - but that’s indeed dependant on processor (I’ll be interested to know how should it take on decent, up-to-date consumer hardware - like 2017 MacBook Pro 13 / 17 (I5 - I7 proc).

Thanks a lot !
Virgil


(Lissyx) #4

What you might want to have a look into is using the C++ binary, it has a -t argument to provide some runtime info. But I think we might want to change that to have more control and decompose inference / decoding ?

i.e., it’d be interesting if you compare a -t call with and without the language model (lm and trie parameters are optional).


(JohSely) #5

Hi,

I also tried to make the inference faster. Of course running it on GPU helped a lot.
CPU: 17 sec
GPU: 3 sec
for ~1 sec audio file

But 3 seconds is still a lot of time. What I found out is, that if I run multiple inferences after each other, only the first one needs that much time, the following ones only needed about half a second. So that would work for a continous speech recognition.


(Lissyx) #6

@johannes.selymes How do you measure time ? It seems like it does cover loading of the files, which are big :slight_smile:


(JohSely) #7

No I just use the -t option of the client.cc
I change the source file to just load another file after the first one, and then run the LocalDsSTT(…) again on the new buffer.
I did not measure the time of the loading and the sox stuff…

./native_client/deepspeech modelsPretrained/output_graph.pb ~/recordings/yes01.wav data/alphabet.txt modelsPretrained/lm.binary modelsPretrained/trie -t
2018-02-13 09:48:36.610077: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-13 09:48:36.610543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Quadro M1000M major: 5 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 2.66GiB
2018-02-13 09:48:36.610559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro M1000M, pci bus id: 0000:01:00.0, compute capability: 5.0)
yes
cpu_time_overall=6.43293 cpu_time_mfcc=0.00152 cpu_time_infer=6.43142
except 
cpu_time_overall=0.43462 cpu_time_mfcc=0.00166 cpu_time_infer=0.43296
yes
cpu_time_overall=0.25265 cpu_time_mfcc=0.00113 cpu_time_infer=0.25152
ok
cpu_time_overall=0.26726 cpu_time_mfcc=0.00106 cpu_time_infer=0.26620

(Lissyx) #8

It’s interesting, can you share your changes ?


(JohSely) #9

Yes, but it some kind of a hack :wink:

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <math.h>
#include <string.h>
#include <sox.h>
#include <time.h>
#include "deepspeech.h"
#ifdef __APPLE__
#include <unistd.h>
#endif

#define N_CEP 26
#define N_CONTEXT 9
#define BEAM_WIDTH 500
#define LM_WEIGHT 1.75f
#define WORD_COUNT_WEIGHT 1.00f
#define VALID_WORD_COUNT_WEIGHT 1.00f

using namespace DeepSpeech;

struct ds_result {
  char* string;
  double cpu_time_overall;
  double cpu_time_mfcc;
  double cpu_time_infer;
};

// DsSTT() instrumented
struct ds_result*
LocalDsSTT(Model& aCtx, const short* aBuffer, size_t aBufferSize,
           int aSampleRate)
{
  float* mfcc;
  struct ds_result* res = (struct ds_result*)malloc(sizeof(struct ds_result));
  if (!res) {
    return NULL;
  }

  clock_t ds_start_time = clock();
  clock_t ds_end_mfcc = 0, ds_end_infer = 0;
  
  int n_frames = 0;
  aCtx.getInputVector(aBuffer, aBufferSize, aSampleRate, &mfcc, &n_frames);
  ds_end_mfcc = clock();

  res->string = aCtx.infer(mfcc, n_frames);
  ds_end_infer = clock();

  free(mfcc);

  res->cpu_time_overall =
    ((double) (ds_end_infer - ds_start_time)) / CLOCKS_PER_SEC;
  res->cpu_time_mfcc =
    ((double) (ds_end_mfcc  - ds_start_time)) / CLOCKS_PER_SEC;
  res->cpu_time_infer =
    ((double) (ds_end_infer - ds_end_mfcc))   / CLOCKS_PER_SEC;

  return res;
}


int soxReadAudioFile(char* filename,  char* &buffer,  size_t &buffer_size, int &sampleRate) {

  sox_format_t* input = sox_open_read(filename, NULL, NULL, NULL);
  assert(input);

  sampleRate = (int)input->signal.rate;

  // Resample/reformat the audio so we can pass it through the MFCC functions
  sox_signalinfo_t target_signal = {
      SOX_UNSPEC, // Rate
      1, // Channels
      16, // Precision
      SOX_UNSPEC, // Length
      NULL // Effects headroom multiplier
  };

  sox_encodinginfo_t target_encoding = {
    SOX_ENCODING_SIGN2, // Sample format
    16, // Bits per sample
    0.0, // Compression factor
    sox_option_default, // Should bytes be reversed
    sox_option_default, // Should nibbles be reversed
    sox_option_default, // Should bits be reversed (pairs of bits?)
    sox_false // Reverse endianness
  };

#ifdef __APPLE__
  // It would be preferable to use sox_open_memstream_write here, but OS-X
  // doesn't support POSIX 2008, which it requires. See Issue #461.
  // Instead, we write to a temporary file.
  char* output_name = tmpnam(NULL);
  assert(output_name);
  sox_format_t* output = sox_open_write(output_name, &target_signal,
                                        &target_encoding, "raw", NULL, NULL);
#else
 
  sox_format_t* output = sox_open_memstream_write(&buffer, &buffer_size,
                                                  &target_signal,
                                                  &target_encoding,
                                                  "raw", NULL);
#endif

  assert(output);

  // Setup the effects chain to decode/resample
  char* sox_args[10];
  sox_effects_chain_t* chain =
    sox_create_effects_chain(&input->encoding, &output->encoding);

  sox_effect_t* e = sox_create_effect(sox_find_effect("input"));
  sox_args[0] = (char*)input;
  assert(sox_effect_options(e, 1, sox_args) == SOX_SUCCESS);
  assert(sox_add_effect(chain, e, &input->signal, &input->signal) ==
         SOX_SUCCESS);
  free(e);

  e = sox_create_effect(sox_find_effect("channels"));
  assert(sox_effect_options(e, 0, NULL) == SOX_SUCCESS);
  assert(sox_add_effect(chain, e, &input->signal, &output->signal) ==
         SOX_SUCCESS);
  free(e);

  e = sox_create_effect(sox_find_effect("output"));
  sox_args[0] = (char*)output;
  assert(sox_effect_options(e, 1, sox_args) == SOX_SUCCESS);
  assert(sox_add_effect(chain, e, &input->signal, &output->signal) ==
         SOX_SUCCESS);
  free(e);

  // Finally run the effects chain
  sox_flow_effects(chain, NULL, NULL);
  sox_delete_effects_chain(chain);

  // Close sox handles
  sox_close(output);
  sox_close(input);

#ifdef __APPLE__
  size_t buffer_size = (size_t)(output->olength * 2);
  char* buffer = (char*)malloc(sizeof(char) * buffer_size);
  FILE* output_file = fopen(output_name, "rb");
  assert(fread(buffer, sizeof(char), buffer_size, output_file) == buffer_size);
  fclose(output_file);
  unlink(output_name);
#endif
	
  return 0;
}

int
main(int argc, char **argv)
{
  if (argc < 4 || argc > 7) {
    printf("Usage: deepspeech MODEL_PATH AUDIO_PATH ALPHABET_PATH [LM_PATH] [TRIE_PATH] [-t]\n");
    printf("  MODEL_PATH\tPath to the model (protocol buffer binary file)\n");
    printf("  AUDIO_PATH\tPath to the audio file to run"
           " (any file format supported by libsox)\n");
    printf("  ALPHABET_PATH\tPath to the configuration file specifying"
           " the alphabet used by the network.\n");
    printf("  LM_PATH\tOptional: Path to the language model binary file.\n");
    printf("  TRIE_PATH\tOptional: Path to the language model trie file created with"
           " native_client/generate_trie.\n");
    printf("  -t\t\tRun in benchmark mode, output mfcc & inference time\n");
    return 1;
  }

  // Initialise DeepSpeech
  Model ctx = Model(argv[1], N_CEP, N_CONTEXT, argv[3], BEAM_WIDTH);

  if (argc > 5) {
    ctx.enableDecoderWithLM(argv[3], argv[4], argv[5], LM_WEIGHT, WORD_COUNT_WEIGHT, VALID_WORD_COUNT_WEIGHT);
  }

  // Initialise SOX
  assert(sox_init() == SOX_SUCCESS);

  
// TOOTOT
  char* buffer;
  size_t buffer_size = 0;
  int sampleRate = 0;
  soxReadAudioFile(argv[2], buffer, buffer_size, sampleRate);


  // Pass audio to DeepSpeech
  // We take half of buffer_size because buffer is a char* while
  // LocalDsSTT() expected a short*
  struct ds_result* result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);


  if (result) {
    if (result->string) {
      printf("%s\n", result->string);
      free(result->string);
    }

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "
             "cpu_time_infer=%.05f\n",
             result->cpu_time_overall,
             result->cpu_time_mfcc,
             result->cpu_time_infer);
    
    }
    free(result);

  }

// read 2nd file for testing
 soxReadAudioFile("/home/johsely/recordings/accept01.wav",buffer,buffer_size, sampleRate);

 result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);
      free(result->string);
    }

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "
             "cpu_time_infer=%.05f\n",
             result->cpu_time_overall,
             result->cpu_time_mfcc,
             result->cpu_time_infer);
    }

    free(result);
  }

// read 3rd file for testing
 soxReadAudioFile("/home/johsely/recordings/test01.wav",buffer,buffer_size, sampleRate);

 result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);
      free(result->string);
    }

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "
             "cpu_time_infer=%.05f\n",
             result->cpu_time_overall,
             result->cpu_time_mfcc,
             result->cpu_time_infer);
    }

    free(result);
  }

// read 4th file for testing
 soxReadAudioFile("/home/johsely/recordings/check01.wav",buffer,buffer_size, sampleRate);

 result = LocalDsSTT(ctx, (const short*)buffer,
                                        buffer_size / 2, sampleRate);

  if (result) {
    if (result->string) {
      printf("%s\n", result->string);
      free(result->string);
    }

    if (!strncmp(argv[argc-1], "-t", 3)) {
      printf("cpu_time_overall=%.05f cpu_time_mfcc=%.05f "
             "cpu_time_infer=%.05f\n",
             result->cpu_time_overall,
             result->cpu_time_mfcc,
             result->cpu_time_infer);
    }

    free(result);
  }

  free(buffer);

  // Deinitialise and quit
  sox_quit();

  return 0;
}

(Lissyx) #10

Right, I made that in a bit more cleaner way, here are my results on a GTX1080:

./deepspeech ~/tmp/deepspeech/models/output_graph.pb ~/tmp/deepspeech/audio/ ~/tmp/deepspeech/models/alphabet.txt -t 2>&1
2018-02-13 13:56:09.560697: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-13 13:56:09.680074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-13 13:56:09.680295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.797
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.55GiB
2018-02-13 13:56:09.680308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Running on directory /home/alexandre/tmp/deepspeech/audio/
> /home/alexandre/tmp/deepspeech/audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=1.89433 cpu_time_mfcc=0.00241 cpu_time_infer=1.89192
> /home/alexandre/tmp/deepspeech/audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=0.47934 cpu_time_mfcc=0.00316 cpu_time_infer=0.47618
> /home/alexandre/tmp/deepspeech/audio//8455-210777-0068.wav
your powr is sufficient i said
cpu_time_overall=0.47399 cpu_time_mfcc=0.00317 cpu_time_infer=0.47081

(Lissyx) #11

Could be great if you can reproduce my steps (using v0.1.1 model and audio files, to compare anything comparable), from my branch: https://github.com/lissyx/DeepSpeech/tree/nc-multi-bench

Instead of passing one WAV file as parameter, just pass a directory containing severals. So in our case, the audio directory extracted from https://github.com/mozilla/DeepSpeech/releases/download/v0.1.1/audio-0.1.1.tar.gz


(Nastasev) #12

Came back with a new set of tests - new hardware (Intel® Core™ i7-6600U CPU @ 2.60GHz × 4, RAM 16GB Ubuntu 16.04LTS). Tensorflow 1.4.1, Deepspeech pretrained set ver 0.1.1

8.5 sec audio -> inference time 12.8 sec (using the below deepspeech.cc - look for VIRGIL12FEB2018 - see especially session inter_op_parallelism_threads - without those changes inference is 19 sec).

I hope my reply is not too chaotic :slight_smile: - didn’t knew how to attach code

`#ifdef DS_NATIVE_MODEL
#define EIGEN_USE_THREADS
#define EIGEN_USE_CUSTOM_THREAD_POOL

#include “third_party/eigen3/unsupported/Eigen/CXX11/Tensor”

#include “native_client/deepspeech_model_core.h” // generated
#endif

#include

#include “deepspeech.h”
#include “deepspeech_utils.h”
#include “alphabet.h”
#include “beam_search.h”

#include “tensorflow/core/public/session.h”
#include “tensorflow/core/platform/env.h”

#define BATCH_SIZE 1

using namespace tensorflow;
using tensorflow::ctc::CTCBeamSearchDecoder;
using tensorflow::ctc::CTCDecoder;

namespace DeepSpeech {

class Private {
public:
Session* session;
GraphDef graph_def;
int ncep;
int ncontext;
Alphabet* alphabet;
KenLMBeamScorer* scorer;
int beam_width;
bool run_aot;
};

DEEPSPEECH_EXPORT
Model::Model(const char* aModelPath, int aNCep, int aNContext,
const char* aAlphabetConfigPath, int aBeamWidth)
{
mPriv = new Private;
mPriv->session = NULL;
mPriv->scorer = NULL;
mPriv->ncep = aNCep;
mPriv->ncontext = aNContext;
mPriv->alphabet = new Alphabet(aAlphabetConfigPath);
mPriv->beam_width = aBeamWidth;
mPriv->run_aot = false;

if (!aModelPath || strlen(aModelPath) < 1) {
std::cerr << “No model specified, will rely on built-in model.” << std::endl;
mPriv->run_aot = true;
return;
}

//VIRGIL12FEB2018
tensorflow::SessionOptions options;
tensorflow::ConfigProto & config = options.config;
config.set_inter_op_parallelism_threads(1);
config.set_intra_op_parallelism_threads(2);
config.set_use_per_session_threads(false);
Status status = NewSession(options, &mPriv->session);
//VIRGIL12FEB2018

//Status status = NewSession(SessionOptions(), &mPriv->session);
if (!status.ok()) {
std::cerr << status.ToString() << std::endl;
return;
}

status = ReadBinaryProto(Env::Default(), aModelPath, &mPriv->graph_def);
if (!status.ok()) {
mPriv->session->Close();
mPriv->session = NULL;
std::cerr << status.ToString() << std::endl;
return;
}

status = mPriv->session->Create(mPriv->graph_def);
if (!status.ok()) {
mPriv->session->Close();
mPriv->session = NULL;
std::cerr << status.ToString() << std::endl;
return;
}

for (int i = 0; i < mPriv->graph_def.node_size(); ++i) {
NodeDef node = mPriv->graph_def.node(i);
if (node.name() == “logits/shape/2”) {
int final_dim_size = node.attr().at(“value”).tensor().int_val(0) - 1;
if (final_dim_size != mPriv->alphabet->GetSize()) {
std::cerr << "Error: Alphabet size does not match loaded model: alphabet "
<< "has size " << mPriv->alphabet->GetSize()
<< ", but model has " << final_dim_size
<< " classes in its output. Make sure you’re passing an alphabet "
<< “file with the same size as the one used for training.”
<< std::endl;
mPriv->session->Close();
mPriv->session = NULL;
return;
}
break;
}
}
}

DEEPSPEECH_EXPORT
Model::~Model()
{
if (mPriv->session) {
mPriv->session->Close();
}

delete mPriv->alphabet;
delete mPriv->scorer;

delete mPriv;
}

DEEPSPEECH_EXPORT
void
Model::enableDecoderWithLM(const char* aAlphabetConfigPath, const char* aLMPath,
const char* aTriePath, float aLMWeight,
float aWordCountWeight, float aValidWordCountWeight)
{
mPriv->scorer = new KenLMBeamScorer(aLMPath, aTriePath, aAlphabetConfigPath,
aLMWeight, aWordCountWeight, aValidWordCountWeight);
}

DEEPSPEECH_EXPORT
void
Model::getInputVector(const short* aBuffer, unsigned int aBufferSize,
int aSampleRate, float** aMfcc, int* aNFrames,
int* aFrameLen)
{
return audioToInputVector(aBuffer, aBufferSize, aSampleRate, mPriv->ncep,
mPriv->ncontext, aMfcc, aNFrames, aFrameLen);
}

char*
Model::decode(int aNFrames, float*** aLogits)
{
const int batch_size = BATCH_SIZE;
const int top_paths = 1;
const int timesteps = aNFrames;
const size_t num_classes = mPriv->alphabet->GetSize() + 1; // +1 for blank

// Raw data containers (arrays of floats, ints, etc.).
int sequence_lengths[batch_size] = {timesteps};

// Convert data containers to the format accepted by the decoder, simply
// mapping the memory from the container to an Eigen::ArrayXi,::MatrixXf,
// using Eigen::Map.
Eigen::Map seq_len(&sequence_lengths[0], batch_size);
std::vector<Eigen::Map> inputs;
inputs.reserve(timesteps);
for (int t = 0; t < timesteps; ++t) {
inputs.emplace_back(&aLogits[t][0][0], batch_size, num_classes);
}

// Prepare containers for output and scores.
// CTCDecoder::Output is std::vector<std::vector>
std::vectorCTCDecoder::Output decoder_outputs(top_paths);
for (CTCDecoder::Output& output : decoder_outputs) {
output.resize(batch_size);
}
float score[batch_size][top_paths] = {{0.0}};
Eigen::MapEigen::MatrixXf scores(&score[0][0], batch_size, top_paths);

if (mPriv->scorer == NULL) {
CTCBeamSearchDecoder<>::DefaultBeamScorer scorer;
CTCBeamSearchDecoder<> decoder(num_classes,
mPriv->beam_width,
&scorer,
batch_size);
decoder.Decode(seq_len, inputs, &decoder_outputs, &scores).ok();
} else {
CTCBeamSearchDecoder decoder(num_classes,
mPriv->beam_width,
mPriv->scorer,
batch_size);
decoder.Decode(seq_len, inputs, &decoder_outputs, &scores).ok();
}

// Output is an array of shape (1, n_results, result_length).
// In this case, n_results is also equal to 1.
size_t output_length = decoder_outputs[0][0].size() + 1;

size_t decoded_length = 1; // add 1 for the \0
for (int i = 0; i < output_length - 1; i++) {
int64 character = decoder_outputs[0][0][i];
const std::string& str = mPriv->alphabet->StringFromLabel(character);
decoded_length += str.size();
}

char* output = (char*)malloc(sizeof(char) * decoded_length);
char* pen = output;
for (int i = 0; i < output_length - 1; i++) {
int64 character = decoder_outputs[0][0][i];
const std::string& str = mPriv->alphabet->StringFromLabel(character);
strncpy(pen, str.c_str(), str.size());
pen += str.size();
}
*pen = ‘\0’;

for (int i = 0; i < timesteps; ++i) {
for (int j = 0; j < batch_size; ++j) {
free(aLogits[i][j]);
}
free(aLogits[i]);
}
free(aLogits);

return output;
}

DEEPSPEECH_EXPORT
char*
Model::infer(float* aMfcc, int aNFrames, int aFrameLen)
{
const int batch_size = BATCH_SIZE;
const int timesteps = aNFrames;
const size_t num_classes = mPriv->alphabet->GetSize() + 1; // +1 for blank

const int frameSize = mPriv->ncep + (2 * mPriv->ncep * mPriv->ncontext);

float*** input_data_mat = (float***)calloc(timesteps, sizeof(float**));
for (int i = 0; i < timesteps; ++i) {
input_data_mat[i] = (float**)calloc(batch_size, sizeof(float*));
for (int j = 0; j < batch_size; ++j) {
input_data_mat[i][j] = (float*)calloc(num_classes, sizeof(float));
}
}

if (mPriv->run_aot) {
#ifdef DS_NATIVE_MODEL
Eigen::ThreadPool tp(2); // Size the thread pool as appropriate.
Eigen::ThreadPoolDevice device(&tp, tp.NumThreads());

nativeModel nm(nativeModel::AllocMode::RESULTS_AND_TEMPS_ONLY);
nm.set_thread_pool(&device);

for (int ot = 0; ot < timesteps; ot += DS_MODEL_TIMESTEPS) {
  nm.set_arg0_data(&(aMfcc[ot * frameSize]));
  nm.Run();

  // The CTCDecoder works with log-probs.
  for (int t = 0; t < DS_MODEL_TIMESTEPS, (ot + t) < timesteps; ++t) {
    for (int b = 0; b < batch_size; ++b) {
      for (int c = 0; c < num_classes; ++c) {
        input_data_mat[ot + t][b][c] = nm.result0(t, b, c);
      }
    }
  }
}

#else
std::cerr << “No support for native model built-in.” << std::endl;
return NULL;
#endif // DS_NATIVE_MODEL
} else {
if (aFrameLen == 0) {
aFrameLen = frameSize;
} else if (aFrameLen < frameSize) {
std::cerr << "mfcc features array is too small (expected " <<
frameSize << ", got " << aFrameLen << “)\n”;
return NULL;
}

Tensor input(DT_FLOAT, TensorShape({1, aNFrames, frameSize}));

auto input_mapped = input.tensor<float, 3>();
for (int i = 0, idx = 0; i < aNFrames; i++) {
  for (int j = 0; j < frameSize; j++, idx++) {
    input_mapped(0, i, j) = aMfcc[idx];
  }
  idx += (aFrameLen - frameSize);
}

Tensor n_frames(DT_INT32, TensorShape({1}));
n_frames.scalar<int>()() = aNFrames;

//VIRGIL12FEB2018
clock_t begin = clock();
std::cout << "------>Start mPriv->session->Run\n";
//VIRGIL12FEB2018

// The CTC Beam Search decoder takes logits as input, we can feed those from
// the "logits" node in official models or
// the "logits_output_node" in old AOT hacking models
std::vector<Tensor> outputs;
Status status = mPriv->session->Run(
  {{ "input_node", input }, { "input_lengths", n_frames }},
  {"logits"}, {}, &outputs);

//VIRGIL12FEB2018
std::cout << "------>Done mPriv->session->Run (decoding goes next): " << double(clock() - begin)  / CLOCKS_PER_SEC << "\n";
begin = clock();
//VIRGIL12FEB2018

// If "logits" doesn't exist, this is an older graph. Try to recover.
if (status.code() == tensorflow::error::NOT_FOUND) {
  status.IgnoreError();
  status = mPriv->session->Run(
    {{ "input_node", input }, { "input_lengths", n_frames }},
    {"logits_output_node"}, {}, &outputs);
}

if (!status.ok()) {
  std::cerr << "Error running session: " << status.ToString() << "\n";
  return NULL;
}

auto logits_mapped = outputs[0].tensor<float, 3>();
// The CTCDecoder works with log-probs.
for (int t = 0; t < timesteps; ++t) {
  for (int b = 0; b < batch_size; ++b) {
    for (int c = 0; c < num_classes; ++c) {
      input_data_mat[t][b][c] = logits_mapped(t, b, c);
    }
  }
}
//VIRGIL12FEB2018
std::cout << "------>Done decoding: " << double(clock() - begin)  / CLOCKS_PER_SEC << "\n";
//VIRGIL12FEB2018

}

return decode(aNFrames, input_data_mat);
}

DEEPSPEECH_EXPORT
char*
Model::stt(const short* aBuffer, unsigned int aBufferSize, int aSampleRate)
{
float* mfcc;
char* string;
int n_frames;

getInputVector(aBuffer, aBufferSize, aSampleRate, &mfcc, &n_frames, NULL);
string = infer(mfcc, n_frames);
free(mfcc);
return string;
}

}
`


(Lissyx) #13

Your result seems now more in-par with what we experience. I’m susprised you had to play with sessions options to trigger threads, I do see threads working here.


(JohSely) #14

@lissyx I can maybe if I have time.
But I think we both showed that the second and following inferences are working way faster then the first one. At least fast enough for me now :slight_smile:


(Lissyx) #15

Yeah, thanks for noticing, I would not have bet on such a difference.


(Nastasev) #16

I confirm Johannes observation - first inference is slower, next ones are going faster.
Did a brief test on the machine with above specs on live (used a custom auditok) STT and it works quite satisfactory (speed, accuracy vs noise). I would say decently close to google offline:-)


(Lissyx) #17

I’ve had a closer look, and as I expected it’s mostly coming from the loading of the protocolbuffer file. Switching to TensorFlow’s MemmappedEnv loading helps.