Error while training tacotron with multigpu

argyadiva · April 2, 2021, 12:47pm

I’m trying to train a tts model with LJSpeech dataset on a multi-gpu server. At the moment, I use 2 gpu but it gives me error when training tacotron model.
(I’m using CUDA 11.0)

['/TTS/TTS/bin/train_tacotron.py', '--continue_path=', '--restore_path=', '--config_path=TTS/tts/configs/ljspeech_tacotron2_dynamic_conv_attn.json', '--group_id=group_2021_04_02-120236', '--rank=0']
    ['/TTS/TTS/bin/train_tacotron.py', '--continue_path=', '--restore_path=', '--config_path=TTS/tts/configs/ljspeech_tacotron2_dynamic_conv_attn.json', '--group_id=group_2021_04_02-120236', '--rank=1']
    2021-04-02 12:02:37.628487: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
    2021-04-02 12:02:37.628523: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    2021-04-02 12:02:37.705717: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
    2021-04-02 12:02:37.705747: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
     > Using CUDA:  True
     > Number of GPUs:  2
       >  Mixed precision mode is ON
     > Git Hash: e9e0784
     > Experiment folder: /home/erogol/Models/LJSpeech/ljspeech-dcattn-April-02-2021_12+02PM-e9e0784
     > Setting up Audio Processor...
     | > sample_rate:48000
     | > resample:False
     | > num_mels:80
     | > min_level_db:-100
     | > frame_shift_ms:None
     | > frame_length_ms:None
     | > ref_level_db:20
     | > fft_size:1024
     | > power:1.5
     | > preemphasis:0.0
     | > griffin_lim_iters:60
     | > signal_norm:True
     | > symmetric_norm:True
     | > mel_fmin:50.0
     | > mel_fmax:7600.0
     | > spec_gain:1.0
     | > stft_pad_mode:reflect
     | > max_norm:4.0
     | > clip_norm:True
     | > do_trim_silence:True
     | > trim_db:60
     | > do_sound_norm:False
     | > stats_path:LJSpeech-1.1/scale_stats.npy
     | > hop_length:256
     | > win_length:1024
     | > Found 13100 files in /TTS/LJSpeech-1.1
     > Using model: Tacotron2
     ! Run is removed from /home/erogol/Models/LJSpeech/ljspeech-dcattn-April-02-2021_12+02PM-e9e0784
    Traceback (most recent call last):
      File "/TTS/TTS/bin/train_tacotron.py", line 721, in <module>
        main(args)
      File "/TTS/TTS/bin/train_tacotron.py", line 575, in main
        model = apply_gradient_allreduce(model)
      File "/venvtts/lib/python3.6/site-packages/TTS-0.0.9.2-py3.6-linux-x86_64.egg/TTS/utils/distribute.py", line 81, in apply_gradient_allreduce
        dist.broadcast(p, 0)
      File "/venvtts/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1039, in broadcast
        work = default_pg.broadcast([tensor], opts)
    RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
    ncclUnhandledCudaError: Call to CUDA function failed.

Any help is appreciated and thank you in advance!