Trying to train vocoder on RTX 3090 with torch 1.8.0 nightly (because 1.7.0 doesn’t support 3090) and get this error. Does anyone know where can be the error?
TRAINING (2020-11-30 16:55:21)
! Run is removed from /home/abai/Downloads/kazakh_synthes/Results/multiband-melgan-November-30-2020_04+55PM-f6c96b0
Traceback (most recent call last):
File “bin/train_vocoder.py”, line 647, in
main(args)
File “bin/train_vocoder.py”, line 551, in main
epoch)
File “bin/train_vocoder.py”, line 149, in train
feats_real, y_hat_sub, y_G_sub)
File “/home/abai/MozillaTTS/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 744, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/abai/TTS/TTS/vocoder/layers/losses.py”, line 234, in forward
stft_loss_mg, stft_loss_sc = self.stft_loss(y_hat.squeeze(1), y.squeeze(1))
File “/home/abai/MozillaTTS/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 744, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/abai/TTS/TTS/vocoder/layers/losses.py”, line 71, in forward
lm, lsc = f(y_hat, y)
File “/home/abai/MozillaTTS/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 744, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/abai/TTS/TTS/vocoder/layers/losses.py”, line 47, in forward
y_hat_M = self.stft(y_hat)
File “/home/abai/TTS/TTS/vocoder/layers/losses.py”, line 26, in call
return_complex=True)
File “/home/abai/MozillaTTS/lib/python3.7/site-packages/torch/functional.py”, line 516, in stft
normalized, onesided, return_complex)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
opening this topic again, as the problem is still here, I tried training on 2080 and 3090, and as I tried none of them are support torch 1.6 (as torch 1.6 doesn’t support Cuda 11).
So this issue happens only on torch 1.7 and 1.8 nightly builds, the solution is to initialize parameters to gpu when the matrix/tensors are created I will search the docs and write the results here
@Dias How you managed to get past this problem? I am having the exact same issue with my RTX 3060. When I try to start training the vocoder model it says “expected all tensors to be on the same device, but found at least two devices”.
I tried downgrading the torch as well, also tried using the nightly version. but the issue remains. Can you please let me know how you fixed it? Thanks.
As I remember, I searched the docs, some tensors in pytorch were loading on gpu, some on cpu, that was the conflict. I don’t know current situation on latest versions, as I was working with it half a year ago
You need to check from your error that in which file you are getting the issue and then shift the tensors to same device, either on cpu (.cpu()) or gpu (.cuda()).
For me it was in /TTS/vocoder/layers/losses.py and i did following change
line 21 self.window.cuda()
P.S. you will need to re install the directory if you do any changes in the files. Or you can directly change the files form your TTS installation under your environment’s site-packages.