Odd memory error 450 epochs in

Hard for me to guess how this is happening, training goes fine for about 450 epochs and then I get an out memory error. Just keeping this here for tracking, I’ll try turning off gradual training next run? Perhaps there is an error in the batch size calculation somehow?

Epoch 450/1000
| > Step:7/188 GlobalStep:72650 PostnetLoss:0.03720 DecoderLoss:0.01557 StopLoss:0.42604 GALoss:0.01249 GradNorm:0.48254 GradNormST:0.29868 AvgTextLen:13.4 AvgSpecLen:102.2 StepTime:0.83 LoaderTime:0.00 LR:0.000100
| > Step:32/188 GlobalStep:72675 PostnetLoss:0.01316 DecoderLoss:0.01272 StopLoss:0.39781 GALoss:0.00317 GradNorm:0.04753 GradNormST:0.06469 AvgTextLen:28.3 AvgSpecLen:158.0 StepTime:0.75 LoaderTime:0.00 LR:0.000100
| > Step:57/188 GlobalStep:72700 PostnetLoss:0.01175 DecoderLoss:0.01227 StopLoss:0.43638 GALoss:0.00153 GradNorm:0.04309 GradNormST:0.04651 AvgTextLen:40.9 AvgSpecLen:228.8 StepTime:0.99 LoaderTime:0.00 LR:0.000100
| > Step:82/188 GlobalStep:72725 PostnetLoss:0.01112 DecoderLoss:0.01198 StopLoss:0.36987 GALoss:0.00122 GradNorm:0.03563 GradNormST:0.03267 AvgTextLen:54.7 AvgSpecLen:308.2 StepTime:1.30 LoaderTime:0.00 LR:0.000100
| > Step:107/188 GlobalStep:72750 PostnetLoss:0.01100 DecoderLoss:0.01211 StopLoss:0.26786 GALoss:0.00071 GradNorm:0.01053 GradNormST:0.01795 AvgTextLen:71.4 AvgSpecLen:381.3 StepTime:1.75 LoaderTime:0.00 LR:0.000100
| > Step:132/188 GlobalStep:72775 PostnetLoss:0.01030 DecoderLoss:0.01113 StopLoss:0.15845 GALoss:0.00060 GradNorm:0.19383 GradNormST:0.01814 AvgTextLen:90.8 AvgSpecLen:499.3 StepTime:2.11 LoaderTime:0.01 LR:0.000100
Traceback (most recent call last):
File “train.py”, line 703, in
main(args)
File “train.py”, line 617, in main
global_step, epoch)
File “train.py”, line 170, in train
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/venvs/sri_tts/TTS/models/tacotron2.py”, line 76, in forward
postnet_outputs = self.postnet(decoder_outputs)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/venvs/sri_tts/TTS/layers/tacotron2.py”, line 45, in forward
x = layer(x)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/venvs/sri_tts/TTS/layers/tacotron2.py”, line 27, in forward
output = self.net(x)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/container.py”, line 92, in forward
input = module(input)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/conv.py”, line 202, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 2; 10.76 GiB total capacity; 9.51 GiB already allocated; 65.12 MiB free; 269.19 MiB cached)

Do you have TensorBoard running in parallel? I have the impression that TB sometimes is leaking memory.

Didn’t see OOM error myself, even at 185k steps, but my training machine has 32GB memory

I can’t be 100% sure it was off when this was running, but I tend to keep it off just out of worry about side effects (data sitting on nfs etc.). I’ll watch that.

Tensorboard was not on and I get the same error when I restart from the last checkpoint just 12 epochs in. At least it seems reproducible…having a hard time understanding where the problem could be given I can get through 450 epochs ok.

Did you try reducing the batchsize? With default settings for gradual training you will start with batchsize 64 and go down to 32 at step 130k.

I had the default gradual training params in the template config:

“gradual_training”: [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the ‘batch_size’ as you proceeed.

I suppose it doesn’t make sense to turn gradual training off since I start at 64 and it should go down…which should not result in out of memory.

I’ve just reduced the size of my dataset (a subset of libriTTS) and I"ll see what happens when the batch size is kept constant (gradual_training:null) just to keep things controlled. I presume there has to be a leak somewhere…

Are you using the latest TTS?

I am using the dev branch on a Mar 10 commit (2a15e391669f9073ba10ef7ff20bb54ec5246977)

Reducing the dataset size and turning off gradual training seems to have gotten me further in terms of epochs so far.

Tacotron or Tacotron2? How large is your GPU mem?

Tacotron2 (using config_template.json with fewest change possible).

RTX 2080 Ti 11Gb x 4

is your dataset LJSpeech?

LJ Speech trains fine with that config. This was from one that was a mix of a custom 3h dataset + 100 random speakers from LibriTTS. I don’t have the run saved, but I believe I can get the same error if I simply take a larger subset of LibriTTS (e.g. 400 speakers).

Porbably libriTTS has some files longer than LJSpeech and it breaks the memory after a while. Try shorten max mel len in the config file.

Ah right, there is nothing stopping me from getting an unlucky batch with lots of longer sentences which could trigger the memory error.

I have made a similar experience, but much later in training: I train Tacotron 2 with the default config settings, i.e. with gradual training activated. My system is certainly rather undersized (RTX 2080 Super 8GB), but apart from the long execution time, it has been running without problems so far. After 290000 iterations, i.e. after switching from [130000, 2, 32] to [290000, 1, 32] I now recieve a memory error.

 > Epoch 952/1000
| > Step:0/357  GlobalStep:290025  PostnetLoss:0.23366  DecoderLoss:0.28552  StopLoss:0.81694  AlignScore:0.2244  GradNorm:1.07657  GradNormST:0.60316  AvgTextLen:24.8  AvgSpecLen:130.9  StepTime:0.65  LoaderTime:0.52  LR:0.000100
| > Step:25/357  GlobalStep:290050  PostnetLoss:0.12785  DecoderLoss:0.15022  StopLoss:0.40783  AlignScore:0.4268  GradNorm:0.16539  GradNormST:0.36246  AvgTextLen:44.6  AvgSpecLen:266.3  StepTime:0.95  LoaderTime:0.01  LR:0.000100
| > Step:50/357  GlobalStep:290075  PostnetLoss:0.11792  DecoderLoss:0.13926  StopLoss:0.28328  AlignScore:0.5682  GradNorm:0.14249  GradNormST:0.16790  AvgTextLen:56.7  AvgSpecLen:346.1  StepTime:1.12  LoaderTime:0.01  LR:0.000100
! Run is kept in outputs/ljspeech-stft_params-May-07-2020_10+22AM-2e2221f
Traceback (most recent call last):
File "train.py", line 724, in <module>
main(args)
File "train.py", line 640, in main
global_step, epoch)
File "train.py", line 167, in train
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
File "/home/**/tmp/tts-venv/lib/python3.6/site-packages/torch-1.5.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/**/workspace/TTS/tts_namespace/TTS/models/tacotron2.py", line 75, in forward
encoder_outputs, mel_specs, mask)
File "/home/**/tmp/tts-venv/lib/python3.6/site-packages/torch-1.5.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/**/workspace/TTS/tts_namespace/TTS/layers/tacotron2.py", line 265, in forward
decoder_output, attention_weights, stop_token = self.decode(memory)
File "/home/**/workspace/TTS/tts_namespace/TTS/layers/tacotron2.py", line 229, in decode
decoder_rnn_input = torch.cat((self.query, self.context), -1)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 7.79 GiB total capacity; 6.39 GiB already allocated; 6.44 MiB free; 6.48 GiB reserved in total by PyTorch)

The logic behind this is not quite clear to me, shouldn’t the memory requirements decrease because of r=1?
I now continue the training with [290000, 1, 16], that seems to work.