Hey all, I’m not sure what’s going wrong here. The first epoch starts for about 2 seconds before it fails with this error. It seems to be right at the initialization phase but I’m just getting into things here and I don’t want to immediately jump into the code and start breaking it lol. I thought I’d ask if anyone that has more experience than I do with CuDA et al. if they have any suggestions first.
! Run is removed from /home/REDACTED/tts/mTTS_IO/REDACTED/output/ljspeech-ddc-April-09-2021_09+52AM-0000000
Traceback (most recent call last):
File "/home/REDACTED/aur/TTS/TTS/bin/train_tacotron.py", line 721, in <module>
main(args)
File "/home/REDACTED/aur/TTS/TTS/bin/train_tacotron.py", line 623, in main
scaler_st)
File "/home/REDACTED/aur/TTS/TTS/bin/train_tacotron.py", line 193, in train
scaler.scale(loss_dict['loss']).backward()
File "/home/REDACTED/anaconda3/envs/mTTS/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/REDACTED/anaconda3/envs/mTTS/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
It’s the first time I’ve used any of this and so I honestly don’t really know how to proceed. Any and all suggestions and advice would be most appreciated. I’m running Manjaro Linux, kernel 5.11, with NVIDIA GeForce GTX 1650 Mobile / Max-Q. I ran into memory errors at first so all I’ve done set gradual training to null, and reduced batch size from 32/16 to 16/8. Thank you!
Edit: typo