When training on bare metal of NVIDIA A100, the loss is nan. I am able to train on MIG - but in this case it uses only one GPU. MIG instances cannot be created by combining multiple physical GPUs.
Initially my observation was that this was an issue of paramater tweaks but since it works on a MIG instance, I’d rather think that this is another issue.
I have used NVIDIA tensorflow (horovod).
Additionally when running nvidia-smi, the percentage of GPU is zero; the CPU neither gets consumed.