NVIDIA A100: Loss nan when training on bare metal -

When training on bare metal of NVIDIA A100, the loss is nan. I am able to train on MIG - but in this case it uses only one GPU. MIG instances cannot be created by combining multiple physical GPUs.

Initially my observation was that this was an issue of paramater tweaks but since it works on a MIG instance, I’d rather think that this is another issue.

I have used NVIDIA tensorflow (horovod).

Additionally when running nvidia-smi, the percentage of GPU is zero; the CPU neither gets consumed.

No idea what that means

Not something we have support for, unfortunately

It’s going to be hard to get to a proper understand of what you are trying if you don’t share a bit more of your setup, like dataset you are working on, volume, training parameters, …

MIG - https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

Dataset is 1000 hours, but I think it is not an issue of dataset since it works in a MIG instance; training batch size is 64, learning rate is 0.0001

1000 hours is few material for an A100, so epochs should be in the minutes. GPU should be 30-50% most of the time. But that is for 0 GPU.

Set dropout to 0.3 or so and check the alphabet.

I think it’s not a matter of dataset, volume, or training parameters because with the same conditions, the training is going very well and very fast just with one MIG, which is made just by one A100 GPU, and is going far faster compared to training in bare metal with 8 A100 GPUs. Except that we are having that issue of negative or nan loss.

Sorry, you’ll have to give some more information or we are just guessing on this end. Why didn’t you write that everything works great on the other server (which GPUs?) and not on the A100. Have you checked the other threads about the A100. What did you do differently?

We did the training on the same server, but when we use GPUs in Bare metal we are having those issues, and is working well if we run it in MIG ( https://www.nvidia.com/en-us/technologies/multi-instance-gpu/ ), but using MIG, we are not able to use more than one GPU in parallel. So, we want to make a training in bare metal …

Negative or nan are already vastly documented as inappropriate fit of dataset wrt training hyperparameters.

Now, you are using stuff we don’t have access to, and we can’t provide support:

  • horovod,
  • A100 GPUs,
  • MIG partitionning

And there are already other people hacking on A100 and/or Horovod who don’t have reported any issue: Training model with NVIDIA A100

We started that discussion too. And after taking some ideas then we managed to run training successfully but just in MIG. We are thankful for your support.

If we manage to run successfully training in bare metal we will share our experience.

Here is another link about MIG which someone may find helpful:

All in all, without knowing more on your hardware as well as the characteristics on your dataset, NaN loss / low GPU usage and no error would be 100% consistent with too aggressive learning rate in balance of too small batch size.

As @lissyx and @othiele already mentioned you have to share precise on your setup and what you did. I even can’t get clear what you want to do…
It does not help to make a reference to MIG the 10th time if you do not share, how you set it up, how many physical GPUs?

MIG is some feature introduces in A100 series which splits up your physical gpu in up to 7 independent gpu instances.
I have not used it yet, but as I understand they share the same address space like multiple physical gpus in one machine.
Therefore, no distributed setup should be necessary.

MIG and and also NVIDIA tensorflow (its just normal TensorFlow 1.15 built for newer devices by NVIDIA because of Googles support deadline) does not depedent on horovod which you referenced in your first post.
But feel free to use our PR and maybe archive better performance.

One more point, since MIC only splits up your physical device it’s predetermined for parallel task which to not saturate a full A100. So there should be no benefit in splitting the same train with MIC.

If you have performance metrics for single A100 and for some MIC instances, share them with us, please.

Additionally when running nvidia-smi, the percentage of GPU is zero; the CPU neither gets consumed.

This is interesting. If nvidia-smi is showing 0% GPU usage, is the GPU actually being used?. Another alternative explanation could be that nvidia-smi does not support MIG architecture, but this does not explain the CPU not being consumed.

Reading through the MIG documentation, mig-mode is not configured by default for nvidia-smi, and it has to be specifically enabled. This could explain why nvidia-smi is not showing any GPU or CPU usage.

1 Like

HI guys,

Thank you all for your help so far, these are the parameters that we are using for the training (Supermicro Ubuntu 18.04, 8 A100 40gb GPU, 128 CPU, 126GB RAM)

python DeepSpeech.py --epochs=100 --early_stop=False
–drop_source_layers 1
–train_files data/train/train.csv
–dev_files data/train/dev.csv
–test_files data/train/test.csv
–export_dir data/export/
–dropout_rate 0.4
–save_checkpoint_dir data/result/checkout/
–load_checkpoint_dir data/result/checkout/
–alphabet_config_path data/alphabet.txt
–summary_dir data/summary
–learning_rate=0.0001 \

The loss become NAN after some steps, but with the same parameters we are able to train in MIG.

Again, you do not explain what means “train in MIG” in your case. How is it configured.

There is a an \ at the end. Something missing?

I guess training runs on 8 gpus? Have you tried using only one with CUDA_VISIBLE_DEVICES?

I think this will do overfitting. Try something around 20.

With 8 gpus and batch size 64 you should have an effective batch size of 512. Maybe increase learning rate? We do so when we use horovod.

1 Like

If you have 40GB of RAM on each GPU, a batch size of 64 can be ridiculous. But again, it depends on what you have in your data …

Yes but we have experimented with batch size from 1 to 2048 (1,2,4,8,16,32,64…) and it’s failing in every single case. We are having OOM or negative or NAN loss.

The majority of data is between 3 and 15 seconds but we have some data 15-30 seconds, but with the same data we managed to make training and we get a validation loss 12% in another machine.

We created a MIG instance with sudo nvidia-smi mig -i 0 -cgi 0 -C and it worked with batch_size 64, and learning rate 0.0001, but it failed even with batch size 96, 128, 256 …

Yes today we managed to train with the same parameters I shared earlier just with batch_size 32 (because with 64 we had OOM) and a learning_rate 0.001, but the loss remains high even after 7 epochs (| Loss: 80.251452 |)

No, it was typo

Ok, we tried with 0.0001, we will try 0.0005 and 0.001

Like you go from NaN loss to OOM, directly? With what values?

try to stick to the 3-15 seconds interval. On data ranging from 3 to 10secs, I can use batch size of 64 on a RTX2080Ti with 11GB without any problem …

Also, how many epochs? Sharing full logs from the beginning would have been helpful …

I have been having the same problem with old tensorflow code.
I can run my old tensoflow codes (tf 1.12, and tf 1.15) in a conda environment on a dgx system with 8 GPUs (32GB), but when I try the same thing on A100 system, I get “nan” in the loss.