CuDA cublasGemmEX failing at the beginning of the first epoch?

Hey all, I’m not sure what’s going wrong here. The first epoch starts for about 2 seconds before it fails with this error. It seems to be right at the initialization phase but I’m just getting into things here and I don’t want to immediately jump into the code and start breaking it lol. I thought I’d ask if anyone that has more experience than I do with CuDA et al. if they have any suggestions first.

 ! Run is removed from /home/REDACTED/tts/mTTS_IO/REDACTED/output/ljspeech-ddc-April-09-2021_09+52AM-0000000
Traceback (most recent call last):
  File "/home/REDACTED/aur/TTS/TTS/bin/train_tacotron.py", line 721, in <module>
    main(args)
  File "/home/REDACTED/aur/TTS/TTS/bin/train_tacotron.py", line 623, in main
    scaler_st)
  File "/home/REDACTED/aur/TTS/TTS/bin/train_tacotron.py", line 193, in train
    scaler.scale(loss_dict['loss']).backward()
  File "/home/REDACTED/anaconda3/envs/mTTS/lib/python3.6/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/REDACTED/anaconda3/envs/mTTS/lib/python3.6/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

It’s the first time I’ve used any of this and so I honestly don’t really know how to proceed. Any and all suggestions and advice would be most appreciated. I’m running Manjaro Linux, kernel 5.11, with NVIDIA GeForce GTX 1650 Mobile / Max-Q. I ran into memory errors at first so all I’ve done set gradual training to null, and reduced batch size from 32/16 to 16/8. Thank you!

Edit: typo

With your combination of OS (most people here use Ubuntu/Debian-style Linux), Ana/Conda (generic Python plus Pip packages is recommended) and „low end“ GPU with only 4GB you might be out of luck. Googling the CUBLAS errormessage hints at the latter as main issue. You can try to reduce batchsize even more, but in the end I doubt you will have good training results at all on your environment…