Training model with NVIDIA A100

betim · February 17, 2021, 3:44pm

Hi,

Is there anyone who managed to run Deepspeech Model training on NVIDIA A100 GPUs?

We got a server with 8 NVIDIA A100 40GB CoWoS HBM2 PCIe 4.0-- Passive Cooling, but couldn’t manage to run training with GPU yet.

lazyguy · February 17, 2021, 4:18pm

Hi Betim,

I tried training it on A100 sometime back but A100 does not support CUDA 10.0, I guess(I could be wrong, it’s been a while last I checked :p). If you can get CUDA 10.0, then it should work.

lissyx · February 17, 2021, 4:27pm

Maybe NVIDIA provides tensorflow r1.15 for those GPUs ?

betim · February 17, 2021, 9:00pm

So we have to uninstall DS tf and install NVIDIA r1.15?

lissyx · February 17, 2021, 9:04pm

There is no such thing as DeepSpeech TensorFlow. We rely on upstream TensorFlow. I read somewhere that NVIDIA is providing a TensorFlow r1.15 package for RTX 3xxx, so maybe this could apply to your case as well.

betim · February 17, 2021, 9:18pm

I see that description match what we’re talking on this gh link: https://github.com/NVIDIA/tensorflow.

@lissyx can you just check it to give me a hint if we’re pointing to the same direction.

lissyx · February 17, 2021, 9:20pm

I’m sorry but if you don’t ask me a clear question, I don’t have time to dig into nvidia’s repos, and I can’t speak for them nor provide support for their work.

lissyx · February 17, 2021, 9:21pm

@betim And the start of their readme is pretty clear to me, it seems to be exactly this usecase that is addressed.

carlfm01 · February 19, 2021, 1:08am

@betim I confirm I can train a model using 3000 series with Nvidia’s version, I guess it also works with the a100 since they are ampere based gpus.

NanoNabla · February 22, 2021, 12:07pm

NVIDIA does also offer docker container with their TensorFlow build
https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
Maybe, you want to give it a try instead of building TensorFlow on you own.
I also can not test A100 GPUs at the moment.

lazyguy · November 19, 2021, 1:26pm

Hi Carl, can you please share which CUDA & CUDNN version you are using? And also what is your tensorflow version? We are planning to use Nvidia A40, your info would greatly help us!

crayabox · January 7, 2022, 3:48pm

Carl and All,

I take the opportunity to update this thread with our experience: we have a server equiped with two Nvidia T4, Driver Version: 460.91.03 and CUDA Version: 11.2.

We were able to perform for months without problem training and inference within a container base on image tensorflow/tensorflow:1.15.4-gpu-py3.

Santa Claus just delivered two A100 that we installed within our server yesterday. But we aren’t able to use GPU anymore: training script PID is correctly referenced in nvidia-smi, data are loaded in GPU RAM, but GPU are not working, only CPU.

xxxxxxx@xxxxx:~$ nvidia-smi
Fri Jan  7 15:51:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0    35W / 250W |  38888MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      On   | 00000000:A3:00.0 Off |                    0 |
| N/A   32C    P0    35W / 250W |    418MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     35281      C   python3                         38885MiB |
|    1   N/A  N/A     35281      C   python3                           415MiB |
+-----------------------------------------------------------------------------+

Does anyone faced this issue? Was the T4/Cuda11/TF1.15 only a mirage, which disapeared when T4 were replaced by A100?

Thanks for your support.

Fabien.

Antonio_Alves · January 18, 2022, 12:48am

Hi, @crayabox!
Did you manage to solve this issue?
I’m in a very similar situation. I’m with a Tesla T4, driver version: 495.29.05, CUDA Version: 11.5, but I can’t run it rightly on GPU, it allocates the process, but only the CPU is been used.

crayabox · January 19, 2022, 11:52am

Ciao @Antonio_Alves ,

Yes we partially managed to solve this issue by using as base for our container an image from Nvidia NGC.

We indeed successfully trained a model based on nvrc.io/nvidia/tensorflow:21.12-tf1-py3 (tf 1.15.5) with our A100 BUT we investigate since yesterday an error during the export of the tflite model with the message attributeerror module 'tensorflow' has no attribute 'lite'.

We have the very same issue when trying image tensorflow:20.11-tf1-py3 that embeds tf 1.15.4.

To be continued.

Fabien.

EDIT: export works in .pb protocol buffer format

masoud_parpanchi · February 19, 2022, 4:07pm

Hi @crayabox
Can you share how exactly you used nvrc.io/nvidia/tensorflow:21.12-tf1-py3 (tf 1.15.5) to train with your GPU?
I’ve pulled this one. but now I face problem using this

python version when i run image is 3.8 and I think this cause problem
when I want to install requirements.txt I get the error below :

ERROR: Could not find a version that satisfies the requirement tensorflow==1.15.4 (from deepspeech-training) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0rc0, 2.4.0rc1, 2.4.0rc2, 2.4.0rc3, 2.4.0rc4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.5.0rc0, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0rc0, 2.6.0rc1, 2.6.0rc2, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.7.0rc0, 2.7.0rc1, 2.7.0, 2.7.1, 2.8.0rc0, 2.8.0rc1, 2.8.0)
ERROR: No matching distribution found for tensorflow==1.15.4

thanks

crayabox · February 21, 2022, 4:46pm

Hi @masoud_parpanchi

I guess I understand your issue: tensorflow is already embedded in the nvidia image, no need to reinstall tf 1.15.4.

On our side, I only install tensorboard 1.15.0 and tensorflow-estimator 1.15.1 on top of the docker image.

Try removing your installation of the package tensorflow==1.15.4, it should work as well as on our side.

Fabien.