Trouble Continuing Training

quavaro · October 11, 2021, 3:00pm

Hi everyone, I am a novice and have been carefully following the documentation and getting help from reading about issues here. Finally, I was able to successfully train on a small dataset for 1000 epochs. The test sentences seem to be heading in the right direction, albeit slowly, so I added more audio, updated the metadata, regenerated scale_stats.npy, and trained again.

I first tried the restore_path command, and it worked for 771 epochs and then hangs indefinitely on a training step. I stopped the training (is there a prescribed way for doing this? I just ctrl+C’d) and then tried continue_path on the newly created folder and that worked for 309 epoch and hung again at this step:

 > EPOCH: 309/1000

 > Number of output frames: 5

 > TRAINING (2021-10-11 09:27:17)

The other training steps take seconds, but this one stopped like that for half an hour, so something was clearly wrong, but there were no error messages. I’ve repeated this 3 or 4 times with the same result, after a few hundred epochs, it just hangs on a training. I am not sure what to do as there is no error message. I only saw an occasional “warning: audio amplitude out of range, auto clipped” message. I am training with a 12GB GPU and 32GB of RAM. Didn’t check VRAM (assumed it was all in use), RAM was sill under 50% in use.

For my config, I’ve made only the bare minimum adjustments to the ljspeech config.json that was in the repository. Does anyone have any suggestions on what to adjust to avoid the sudden stop in training? Lower batch size? Adjust the loudness of audio?