GPU not used while training mozilla TTS

Hi,
I am using the below command to use GPU for training of mozilla TTS.
NOTE: I have single GPU.
CUDA_VISIBLE_DEVICES=“0” python /home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py --config_path /home/ubuntu/drive_a/mayank/TTS/TTS/tts/configs/config.json

But getting the below error:
Traceback (most recent call last):
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py”, line 69, in
main()
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py”, line 46, in main
command = [os.path.join(folder_path, args.script)]
File “/home/ubuntu/.conda/envs/mayank_tts/lib/python3.6/posixpath.py”, line 94, in join
genericpath._check_arg_types(‘join’, a, *p)
File “/home/ubuntu/.conda/envs/mayank_tts/lib/python3.6/genericpath.py”, line 149, in _check_arg_types
(funcname, s.class.name)) from None
TypeError: join() argument must be str or bytes, not ‘NoneType’

Hi, is CUDA installed?

yes, cuda is installed
NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1

any reason you are using distribute.py? since you only have 1 gpu

if I use below command
python / home/ ubuntu /drive_a/mayank/TTS/ TTS/bin/train_tts.py --config_path / home/ ubuntu /drive_a/mayank/TTS/ TTS/tts/configs/config.json

in logs, I get:
Using CUDA: False

Number of GPUs: 0

Then it looks to be a general problem. You can try following this guide in reinstalling CUDA if you like, https://medium.com/@exesse/cuda-10-1-installation-on-ubuntu-18-04-lts-d04f89287130

but first check nvidia-smi if the GPU shows :slight_smile:

1 Like

then there is something wrong with your configuration/installation

what is your pytorch version?

Wed Nov 11 21:43:08 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:17:00.0 Off | N/A |
| 24% 44C P0 42W / 215W | 0MiB / 7982MiB | 1% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 2080 Off | 00000000:65:00.0 Off | N/A |
| 25% 52C P0 N/A / N/A | 0MiB / 7979MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

@sanjaesc , pytorch version:

import torch
print(torch.version)
1.5.0

so you have 2 gpus?

did you try running

CUDA_VISIBLE_DEVICES="0" python / home/ ubuntu /drive_a/mayank/TTS/ TTS/bin/train_tts.py --config_path / home/ ubuntu /drive_a/mayank/TTS/ TTS/tts/configs/config.json

torch.cuda.is_available() returns false I guess

yes , you are correct:
torch.cuda.is_available() returns false.

runnig
CUDA_VISIBLE_DEVICES=“0” python / home/ ubuntu /drive_a/mayank/TTS/ TTS/bin/train_tts.py --config_path / home/ ubuntu /drive_a/mayank/TTS/ TTS/tts/configs/config.json

throws error:
Traceback (most recent call last):
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py”, line 69, in
main()
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py”, line 46, in main
command = [os.path.join(folder_path, args.script)]
File “/home/ubuntu/.conda/envs/mayank_tts/lib/python3.6/posixpath.py”, line 94, in join
genericpath._check_arg_types(‘join’, a, *p)
File “/home/ubuntu/.conda/envs/mayank_tts/lib/python3.6/genericpath.py”, line 149, in _check_arg_types
(funcname, s. class . name )) from None
TypeError: join() argument must be str or bytes, not ‘NoneType’

why do you get an error for distribute.py when you run train_tts.py?

but there seems to be something wrong with your cuda installation

1 Like

thanks @georroussos @sanjaesc for your help. I will fix cuda installation issues…

after fixing cuda installation issue…

Command Run:
python / home/ ubuntu /drive_a/mayank/TTS/ TTS/bin/train_tts.py --config_path / home/ ubuntu /drive_a/mayank/TTS/ TTS/tts/configs/config.json

Console prints:
Using CUDA: True

Number of GPUs: 2

but it seems screen is freezed forever and no epoch execution could be seen:

(mayank_tts) ubuntu@ubuntu-16:~$ python /home/ubuntu/drive_a/mayank/TTS/TTS/bin/train_tts.py --config_path /home/ubuntu/drive_a/mayank/TTS/TTS/tts/configs/config.json
/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
_np_qint8 = np.dtype([(“qint8”, np.int8, 1)])
/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
_np_quint8 = np.dtype([(“quint8”, np.uint8, 1)])
/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
_np_qint16 = np.dtype([(“qint16”, np.int16, 1)])
/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
_np_quint16 = np.dtype([(“quint16”, np.uint16, 1)])
/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
_np_qint32 = np.dtype([(“qint32”, np.int32, 1)])
/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
np_resource = np.dtype([(“resource”, np.ubyte, 1)])

Using CUDA: True
Number of GPUs: 2
fatal: Not a git repository (or any of the parent directories): .git
Git Hash: 0000000
Experiment folder: /home/ubuntu/drive_a/tanmay/tts/s3_tts_trainingWaveFiles/mono/normalised/chunks/Dataset/ljspeech-ddc-November-12-2020_02+20PM-0000000
fatal: Not a git repository (or any of the parent directories): .git
Setting up Audio Processor…
| > sample_rate:22050
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:/home/ubuntu/drive_a/tanmay/tts/s3_tts_trainingWaveFiles/mono/normalised/chunks/Dataset/scale_stats.npy
| > hop_length:256
| > win_length:1024

Also checked:
(mayank_tts) ubuntu@ubuntu-16:~$ nvidia-smi
Thu Nov 12 14:40:04 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:17:00.0 Off | N/A |
| 24% 32C P8 11W / 215W | 3MiB / 7982MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 2080 Off | 00000000:65:00.0 Off | N/A |
| 25% 40C P8 18W / 215W | 3MiB / 7979MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
(mayank_tts) ubuntu@ubuntu-16:~$ python
Python 3.6.12 |Anaconda, Inc.| (default, Sep 8 2020, 23:10:56)
[GCC 7.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
print(torch.version)
1.5.0

torch.cuda.is_available()
True

It is probably because you have more than 1 GPUs and you are using train.py, so it gets stuck when it tries to spawn processes. Try running with selecting a specific GPU, for example OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 python3 train_tts.py

1 Like

running fine thanks…

I would advice against OMP_NUM_THREADS=1 if the cpu is not a bottle neck

Command Used:
OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0 python3 train_tts.py

but getting Error:
Traceback (most recent call last):
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/train_tts.py”, line 715, in
main(args)
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/train_tts.py”, line 627, in main
global_step, epoch, amp, speaker_mapping)
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/train_tts.py”, line 163, in train
text_input, text_lengths, mel_input, mel_lengths, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings)
File “/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/ubuntu/drive_a/mayank/TTS/TTS/tts/models/tacotron2.py”, line 138, in forward
decoder_outputs_backward, alignments_backward = self._coarse_decoder_pass(mel_specs, encoder_outputs, alignments, input_mask)
File “/home/ubuntu/drive_a/mayank/TTS/TTS/tts/models/tacotron_abstract.py”, line 158, in _coarse_decoder_pass
encoder_outputs.detach(), mel_specs, input_mask)
File “/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/ubuntu/drive_a/mayank/TTS/TTS/tts/layers/tacotron2.py”, line 326, in forward
decoder_output, attention_weights, stop_token = self.decode(memory)
File “/home/ubuntu/drive_a/mayank/TTS/TTS/tts/layers/tacotron2.py”, line 287, in decode
dim=1)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 7.80 GiB total capacity; 6.93 GiB already allocated; 2.31 MiB free; 6.98 GiB reserved in total by PyTorch)

As suggested in other threads, Reduced batch_size to 16(from 32) but same error is occuring

Since I have 2 GPUs, Thus tried the below command:
CUDA_VISIBLE_DEVICES=0,1 python / home/ ubuntu /drive_a/mayank/TTS/ TTS/bin/ distribute .py --config_path / home/ ubuntu /drive_a/mayank/TTS/ TTS/tts/configs/config.json

but getting error:
Traceback (most recent call last):
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py”, line 69, in
main()
File “/home/ubuntu/drive_a/mayank/TTS/TTS/bin/distribute.py”, line 46, in main
command = [os.path.join(folder_path, args.script)]
File “/home/ubuntu/.conda/envs/mayank_tts/lib/python3.6/posixpath.py”, line 94, in join
genericpath._check_arg_types(‘join’, a, *p)
File “/home/ubuntu/.conda/envs/mayank_tts/lib/python3.6/genericpath.py”, line 149, in _check_arg_types
(funcname, s.class.name)) from None
TypeError: join() argument must be str or bytes, not 'NoneType

also tried:
CUDA_VISIBLE_DEVICES=’0,1’ python / home/ ubuntu /drive_a/mayank/TTS/ TTS/bin/ distribute .py --config_path / home/ ubuntu /drive_a/mayank/TTS/ TTS/tts/configs/config.json

same error as above

Try to shorten you sequence length.