Guide: Using Automatic Mixed Precision for NVIDIA Tensor Cores

Hello, this will be a quick guide to deploy Nvidia Docker container and take advantage of the Nvidia Tensor Cores without changing the code.

Before starting the deployment of the optimized container should read:

The requirements are almost the same as normal DeepSpeech deployment for training, we will require 3 extra things:

Requirements :

  1. A GPU that contains Tensorcores, you should research if your GPU has TensorCores.

  2. The NVIDIA NGC TensorFlow 19.03 container

  3. Install docker, it depends on your platform so, I’m not including the installation in this guide.

Installing the container:

Before installing the container, you should only clone the DeepSpeech repo and remove tensorflow requirement from requeriments.txt

Next, we pull the container by running:

docker pull

Now we should setup our workspace inside the downloaded container, to activate the container we will run:

sudo nvidia-docker run -it --rm -v $HOME:$HOME

Notice how I used my current home, this allow me to use my existing paths inside the container. If you don’t want to match home, you can use to set it to any other directory:

sudo nvidia-docker run -it --rm -v /user/home/deepspeech:/deepspeech

In my case I was using a cloud instance with an extra mounted disk, if you need to add other path to the image like I required to just add an extra -v .

Now we need to install the requirements:

Again, make sure you removed TensorFlow dependency from requirements.txt, the container already is using an optimized version of TensorFlow fully compatible with DeepSpeech.

Run inside the container at your deepspeech cloned repo:

pip3 install -r requirements.txt

We need the decoder too:

pip3 install $(python3 util/ --decoder)

You probably will hit an issue related to pandas and python 3.5

To fix the issue run:

python3 -m pip install --upgrade pandas

Notice that we don’t need to use a virtual environment.

Finally, we need to enable the use of auto mixed precision by:


To check if your GPU is using tensor cores you can use nvprof in front of your command, something like:

nvprof python --the rest of your params

Then you will get a log of used instructions, to know if the tensor cores were used you need to search for: s884gem_fp16

My result on my small test of 300h and 1 V100 GPU:

Type Time WER Epochs
Normal training(fp32) 2:27:54 0.084087 10
Auto Mixed precision training(fp16) 1:39:03 0.091663 10

Unfortunately, I can’t run larger test.

This is a potential PR, please feel free to suggest any changes and share insights if you use the container.

Nice guide, thanks! One question, in your table, when you refer to normal training, is it using TensorRT or is it something different ?

Hello @lissyx

I think TensorRT is only for inference, and I meant the fp32 training. Thanks, I’ve edited the post to make it clear.

Would it be possible to use this new feature on an RTX card without retraining the model by simply enabling mixed precision in the newer docker container image?


Alternatively, would a simple alternative be to load the existing pre-trained weights and then using this mode do a few passes of the data just to adjust the weights with the new lower precision fp16 and then export that new model?

@carlfm01 thats awesome. Do you have any idea on potential benchmark numbers for inference?

@lissyx On a higher level, what is the bottleneck for inference time on the current DeepSpeech implementation? I know its an RNN so I assume its less parallelizable than other models like image based CNNs.

Here are the numbers I get for a 4 second test file using DeepSpeech 0.3:

CPU: AMD EPYC 16 core
CARD: RTX 2080 in 655ms
CARD: GTX 1080 in 700ms
CARD: GTX 680 in 1200ms

Comparing the RTX 2080 and the GTX 1080 it seems that the GPU architecture isn’t haven’t a huge difference.

It’s certainly not using the Tensor Cores.

What is the best way to get these numbers down?
Try the above Mixed Precision mode on the RTX?
Would the TensorFlow Lite FP16 model be able to take advantage of the Tensor Cores as well?

Keen to hear any ideas on getting inference speed down.

Since it requires special handling, it’s not surprising

Again, for inference, we know GPUs are faster but we also know any decent GPU will give good perfs, there’s no huge improvement expected. Your 680/1080/2080 comparison is likely more influenced by clock speed / memory bandwidth rather than the GPU’s architecture.

At some point, only batching more inferences together will likely help much more. In the end it all depends on what you want to achieve.

No idea, getting real time accross devices on more interactive usecase is more of a priority for us. We are welcoming anything improving other use cases of course.

@lissyx thanks for the quick response.

@carlfm01 did you get any numbers on inference time with this TF_ENABLE_AUTO_MIXED_PRECISION mode?

Sorry, overheating here, I missed Lite: GPU delegation on TFLite is still in early stages, so I don’t think upstream even cares about that …

It should work, I’m doing the opposite, training first on fp16 and then fine tuning on fp32. When you train on fp16 the entire saved model will be fp32.

No, sorry :confused:

You can ask here, maybe there are new uptades :