Hi,
Is there anyone who managed to run Deepspeech Model training on NVIDIA A100 GPUs?
We got a server with 8 NVIDIA A100 40GB CoWoS HBM2 PCIe 4.0-- Passive Cooling, but couldn’t manage to run training with GPU yet.
Hi,
Is there anyone who managed to run Deepspeech Model training on NVIDIA A100 GPUs?
We got a server with 8 NVIDIA A100 40GB CoWoS HBM2 PCIe 4.0-- Passive Cooling, but couldn’t manage to run training with GPU yet.
Hi Betim,
I tried training it on A100 sometime back but A100 does not support CUDA 10.0, I guess(I could be wrong, it’s been a while last I checked :p). If you can get CUDA 10.0, then it should work.
Maybe NVIDIA provides tensorflow r1.15 for those GPUs ?
So we have to uninstall DS tf and install NVIDIA r1.15?
There is no such thing as DeepSpeech TensorFlow. We rely on upstream TensorFlow. I read somewhere that NVIDIA is providing a TensorFlow r1.15 package for RTX 3xxx, so maybe this could apply to your case as well.
I see that description match what we’re talking on this gh link: https://github.com/NVIDIA/tensorflow.
@lissyx can you just check it to give me a hint if we’re pointing to the same direction.
I’m sorry but if you don’t ask me a clear question, I don’t have time to dig into nvidia’s repos, and I can’t speak for them nor provide support for their work.
@betim And the start of their readme is pretty clear to me, it seems to be exactly this usecase that is addressed.
@betim I confirm I can train a model using 3000 series with Nvidia’s version, I guess it also works with the a100 since they are ampere based gpus.
NVIDIA does also offer docker container with their TensorFlow build
https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
Maybe, you want to give it a try instead of building TensorFlow on you own.
I also can not test A100 GPUs at the moment.
Hi Carl, can you please share which CUDA & CUDNN version you are using? And also what is your tensorflow version? We are planning to use Nvidia A40, your info would greatly help us!
Carl and All,
I take the opportunity to update this thread with our experience: we have a server equiped with two Nvidia T4, Driver Version: 460.91.03 and CUDA Version: 11.2.
We were able to perform for months without problem training and inference within a container base on image tensorflow/tensorflow:1.15.4-gpu-py3.
Santa Claus just delivered two A100 that we installed within our server yesterday. But we aren’t able to use GPU anymore: training script PID is correctly referenced in nvidia-smi, data are loaded in GPU RAM, but GPU are not working, only CPU.
xxxxxxx@xxxxx:~$ nvidia-smi
Fri Jan 7 15:51:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:27:00.0 Off | 0 |
| N/A 34C P0 35W / 250W | 38888MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB On | 00000000:A3:00.0 Off | 0 |
| N/A 32C P0 35W / 250W | 418MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 35281 C python3 38885MiB |
| 1 N/A N/A 35281 C python3 415MiB |
+-----------------------------------------------------------------------------+
Does anyone faced this issue? Was the T4/Cuda11/TF1.15 only a mirage, which disapeared when T4 were replaced by A100?
Thanks for your support.
Fabien.
Hi, @crayabox!
Did you manage to solve this issue?
I’m in a very similar situation. I’m with a Tesla T4, driver version: 495.29.05, CUDA Version: 11.5, but I can’t run it rightly on GPU, it allocates the process, but only the CPU is been used.
Ciao @Antonio_Alves ,
Yes we partially managed to solve this issue by using as base for our container an image from Nvidia NGC.
We indeed successfully trained a model based on nvrc.io/nvidia/tensorflow:21.12-tf1-py3 (tf 1.15.5) with our A100 BUT we investigate since yesterday an error during the export of the tflite model with the message attributeerror module 'tensorflow' has no attribute 'lite'
.
We have the very same issue when trying image tensorflow:20.11-tf1-py3 that embeds tf 1.15.4.
To be continued.
Fabien.
EDIT: export works in .pb protocol buffer format
Hi @crayabox
Can you share how exactly you used nvrc.io/nvidia/tensorflow:21.12-tf1-py3 (tf 1.15.5) to train with your GPU?
I’ve pulled this one. but now I face problem using this
python version when i run image is 3.8 and I think this cause problem
when I want to install requirements.txt I get the error below :
ERROR: Could not find a version that satisfies the requirement tensorflow==1.15.4 (from deepspeech-training) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0rc0, 2.4.0rc1, 2.4.0rc2, 2.4.0rc3, 2.4.0rc4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.5.0rc0, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0rc0, 2.6.0rc1, 2.6.0rc2, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.7.0rc0, 2.7.0rc1, 2.7.0, 2.7.1, 2.8.0rc0, 2.8.0rc1, 2.8.0)
ERROR: No matching distribution found for tensorflow==1.15.4
thanks
I guess I understand your issue: tensorflow is already embedded in the nvidia image, no need to reinstall tf 1.15.4.
On our side, I only install tensorboard 1.15.0 and tensorflow-estimator 1.15.1 on top of the docker image.
Try removing your installation of the package tensorflow==1.15.4, it should work as well as on our side.
Fabien.