I’m trying to train a tts model with LJSpeech dataset on a multi-gpu server. At the moment, I use 2 gpu but it gives me error when training tacotron model.
(I’m using CUDA 11.0)
['/TTS/TTS/bin/train_tacotron.py', '--continue_path=', '--restore_path=', '--config_path=TTS/tts/configs/ljspeech_tacotron2_dynamic_conv_attn.json', '--group_id=group_2021_04_02-120236', '--rank=0']
['/TTS/TTS/bin/train_tacotron.py', '--continue_path=', '--restore_path=', '--config_path=TTS/tts/configs/ljspeech_tacotron2_dynamic_conv_attn.json', '--group_id=group_2021_04_02-120236', '--rank=1']
2021-04-02 12:02:37.628487: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-02 12:02:37.628523: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-04-02 12:02:37.705717: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-02 12:02:37.705747: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
> Using CUDA: True
> Number of GPUs: 2
> Mixed precision mode is ON
> Git Hash: e9e0784
> Experiment folder: /home/erogol/Models/LJSpeech/ljspeech-dcattn-April-02-2021_12+02PM-e9e0784
> Setting up Audio Processor...
| > sample_rate:48000
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:LJSpeech-1.1/scale_stats.npy
| > hop_length:256
| > win_length:1024
| > Found 13100 files in /TTS/LJSpeech-1.1
> Using model: Tacotron2
! Run is removed from /home/erogol/Models/LJSpeech/ljspeech-dcattn-April-02-2021_12+02PM-e9e0784
Traceback (most recent call last):
File "/TTS/TTS/bin/train_tacotron.py", line 721, in <module>
main(args)
File "/TTS/TTS/bin/train_tacotron.py", line 575, in main
model = apply_gradient_allreduce(model)
File "/venvtts/lib/python3.6/site-packages/TTS-0.0.9.2-py3.6-linux-x86_64.egg/TTS/utils/distribute.py", line 81, in apply_gradient_allreduce
dist.broadcast(p, 0)
File "/venvtts/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1039, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Any help is appreciated and thank you in advance!