Deepspeech with TensorRT

Nobody has tried it, or if they have tried it, they didn’t share the results with us.

okay, but is it really possible? Because i read that some architecture part in deep speech is not yet supported by tesorRT?

I don’t know. I imagine I would find out if I tried doing it :slight_smile:

I am actually attempting this right now. So, from what I have read online, you need to freeze the graph after running the checkpoints and then run TensorRT? Have you had an any luck?

Nobody on the team has had time to seriously dig into that, unfortunately.

Ok, if I am able to make it work how would I be able to share these results? Could I merge it on Github?

1 Like

Well, I don’t see any reason this cannot be sent as a PR ? At some point, merging will also depend on the changes themselves, of course, but we have nothing against this.

Ok, so an update. I was able to use the frozen graph (output_graph.pb from the pretrained model) to create the tensorRT file and attempted to run the inference. However, two things:

  1. The TensorRT file size (called trt_output_graph.pb) was > output_graph.pb file size by a few bytes (so not much difference)
  2. The inference itself was SIGNIFICANTLY slower. It did not speed things up in the slightest.

That said, I read an earlier post where someone attempted to convert file to a .uff file and tried to run it on deepspeech. Can this be done–using a .uff file instead of a .pb (and .pbmm)?

1 Like

That really feels weird, isn’t it supposed to be exactly the opposite?

Yes, it is. But under further investigation TensorRT does not support the AudioSpectogram Op, MFCC Op, and the NoOp (though this is technically irrelevant). So therefore, it could not optimize them

Right, but they should not slow down. Can you share more insight on the whole process? Nobody on the team studied carefully TensorRT.

How is the graph ran when you do that ? Using our libdeepspeech.so ? Using something else ?

Ok, so the first thing I did was carefully read the NVIDIA documentation (the developer guide, the sample guide, etc) .

Basically, in a nutshell the developer guide provided by NVIDIA shows which operations (ops) Tensor RT has simplified and and the sample code provides an example of how to go about it. When I say simplified, I mean improved or optimized the model, which lowers latency and runs the inference faster. After reading the documentation, you will come to realize that to run Tensor RT there needs to be two steps:

  1. Convert the original model (via a few different ways) into a tensor RT model
  2. Run the inference
    Here comes the confusing part: there’s an easy to do this, and a slightly more challenging (but more powerful way to this).

The easy way:

from tensorflow.contrib import tensorrt as trt 
trt_graph = trt.create_inference_graph(
        input_graph_def=graph_def,  # frozen model
        outputs=['logits'],
        max_batch_size=512,  # specify your max batch size
        max_workspace_size_bytes=2 * (10 ** 9),  # specify the max workspace
        #     precision_mode="FP32")  # precision, can be "FP32" (32 floating point precision) or "FP16" .

# write the TensorRT model to be used later for inference
        with tf.gfile.GFile("/home/me/trt_output_graph.pb", 'wb') as g:
            g.write(trt_graph.SerializeToString())

The hard way involves converting your frozen model into a UFF file and then writing your own separate inference engine to run the file. Documentation for that is provided in the examples of TensorRT.

I chose to implement the easy way because I am not familiar with the in and outs of the inference in the main DeepSpeech.py and because I have used it before (it works!). I understand how DeepSpeech works, but I am not sure if implementing my own inference engine for DeepSpeech is the optimal solution for this problem. Hence, even though I chose the easier route, I do not perform any type of inference on my Tensor RT model after converting it from the frozen_graph. Instead, after converting it, I ran the convert_graphdef_memmapped_format command and then ran deepspeech. I did not use libdeepspeech.so. Is that on the main site?

Any information would be greatly appreciated. I would love to get this working and put in pull request

NOTE: Tensor RT requires specific libraries to be installed. I have the newest version Tensor RT 5.1.5, which requires CUDA10.0, cuDNN 7.5.0, and TensorFlow 1.13.1

1 Like

Yes, when you use the bindings or the C++ client, you use libdeepspeech.so

Well at that point I’m a bit helpless because I have no idea what might be wrong :slight_smile:.

Well that’s likely not complicated but we already have TensorFlow and TFLite support in libdeepspeech.so, you can have a look at native_client/deepspeech.cc and a few other linked files.

One thing that might pop in my mind and you have not clearly replied to (you just document how to do the conversion) is: how do you run the new model ? Since it is required to have TensorRT enabled at buid time, I’m wondering if it is possible you have a tensorrt-compatible model that is ran without a tensorrt-compatible libdeepspeech.so, and so it’s kind of “emulated”.

I’m sorry. I don’t be mean to rude, but I am little confused by what you mean. I did not train the model from scratch. I used your frozen pretrained model, converted it to Tensor RT, and then ran it with this command—

deepspeech --model deepspeech-0.5.0-models/trt_output_graph.pbmm --alphabet deepspeech-0.5.0-models/alphabet.txt --lm deepspeech-0.5.0-models/lm.binary --trie deepspeech-0.5.0-models/trie --audio audio/8455-210777-0068.wav 

I got back the following error:

Not found: Op type not registered 'TRTEngineOp' in binary running on Minerva. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
Traceback (most recent call last):
  File "/home/akhil/.local/bin/deepspeech", line 11, in <module>
    sys.exit(main())
  File "/home/akhil/.local/lib/python3.6/site-packages/deepspeech/client.py", line 89, in main
    ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
  File "/home/akhil/.local/lib/python3.6/site-packages/deepspeech/__init__.py", line 24, in __init__
    raise RuntimeError("CreateModel failed with error code {}".format(status))
RuntimeError: CreateModel failed with error code 12294

Also I did not use Bazel to build TensorFlow. I used pip3.

No problem, your command line is exactly what I was missing !

So, a few things:

  • I don’t know if tensorflow-gpu provided by upstream is TensorRT-enabled by default
  • You will have to rebuild libdeepspeech.so following the CUDA instructions and enabling TensorRT yourself

Ok, thank you.

Do you think another option would be to train the model from scratch and have the Tensor RT enabled and see what happens?

Also, I see references to batch size in your conversion code. I don’t know how TensorRT works, is it possible it will only give speedup on huge workload, and single-file single-batch inference is just not leveraging it ? There might be some first run / setup cost that is much higher than running inference itself.

I don’t think it is going to make any difference.

You’re right. The documentation supports that. I’m going to debug this some more and get back to you. Thank you for your help

So maybe you want to try and see inference time using evaluate.py codebase then, which is closer to a intensive-inference workload.