Distributed training with train_vocoder_gan.py error

dyson · March 4, 2021, 12:04pm

Hello everyone,

So far, I’ve successfully trained a model with Tacotron 2 and the synthesized speech with Universal FullBand-MelGAN sounds okay. To further improve the quality and makes the synthesized voice sounds more similar to the original speaker, I decided to train my own vocoder using the same dataset.
But when I use the following command:
CUDA_VISIBLE_DEVICES="0,1,2" OMP_NUM_THREADS=1 python TTS/bin/distribute.py --script train_vocoder_gan.py --config_path config_vocoder_PWgan.json

I got the following output:

Traceback (most recent call last):
  File "/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py", line 654, in <module>
    main(args)
  File "/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py", line 559, in main
    epoch)
  File "/home/ldai/projects/TTS/TTS/bin/train_vocoder_gan.py", line 114, in train
    y_hat = model_G(c_G)
  File "/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ldai/anaconda3/envs/mozillatts/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

I didn’t get such an error when using train_tacotron.py with the same command.
Any suggestions?

Libraries version:
python=3.6.12
torch=1.7.1