Hard for me to guess how this is happening, training goes fine for about 450 epochs and then I get an out memory error. Just keeping this here for tracking, I’ll try turning off gradual training next run? Perhaps there is an error in the batch size calculation somehow?
Epoch 450/1000
| > Step:7/188 GlobalStep:72650 PostnetLoss:0.03720 DecoderLoss:0.01557 StopLoss:0.42604 GALoss:0.01249 GradNorm:0.48254 GradNormST:0.29868 AvgTextLen:13.4 AvgSpecLen:102.2 StepTime:0.83 LoaderTime:0.00 LR:0.000100
| > Step:32/188 GlobalStep:72675 PostnetLoss:0.01316 DecoderLoss:0.01272 StopLoss:0.39781 GALoss:0.00317 GradNorm:0.04753 GradNormST:0.06469 AvgTextLen:28.3 AvgSpecLen:158.0 StepTime:0.75 LoaderTime:0.00 LR:0.000100
| > Step:57/188 GlobalStep:72700 PostnetLoss:0.01175 DecoderLoss:0.01227 StopLoss:0.43638 GALoss:0.00153 GradNorm:0.04309 GradNormST:0.04651 AvgTextLen:40.9 AvgSpecLen:228.8 StepTime:0.99 LoaderTime:0.00 LR:0.000100
| > Step:82/188 GlobalStep:72725 PostnetLoss:0.01112 DecoderLoss:0.01198 StopLoss:0.36987 GALoss:0.00122 GradNorm:0.03563 GradNormST:0.03267 AvgTextLen:54.7 AvgSpecLen:308.2 StepTime:1.30 LoaderTime:0.00 LR:0.000100
| > Step:107/188 GlobalStep:72750 PostnetLoss:0.01100 DecoderLoss:0.01211 StopLoss:0.26786 GALoss:0.00071 GradNorm:0.01053 GradNormST:0.01795 AvgTextLen:71.4 AvgSpecLen:381.3 StepTime:1.75 LoaderTime:0.00 LR:0.000100
| > Step:132/188 GlobalStep:72775 PostnetLoss:0.01030 DecoderLoss:0.01113 StopLoss:0.15845 GALoss:0.00060 GradNorm:0.19383 GradNormST:0.01814 AvgTextLen:90.8 AvgSpecLen:499.3 StepTime:2.11 LoaderTime:0.01 LR:0.000100
Traceback (most recent call last):
File “train.py”, line 703, in
main(args)
File “train.py”, line 617, in main
global_step, epoch)
File “train.py”, line 170, in train
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/venvs/sri_tts/TTS/models/tacotron2.py”, line 76, in forward
postnet_outputs = self.postnet(decoder_outputs)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/venvs/sri_tts/TTS/layers/tacotron2.py”, line 45, in forward
x = layer(x)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/venvs/sri_tts/TTS/layers/tacotron2.py”, line 27, in forward
output = self.net(x)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/container.py”, line 92, in forward
input = module(input)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/torch/nn/modules/conv.py”, line 202, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 2; 10.76 GiB total capacity; 9.51 GiB already allocated; 65.12 MiB free; 269.19 MiB cached)