I’ve been trying to fine-tune the LJSpeech dataset (from the Tacotron-iter-260k branch) on a dataset of about 8 hours with a single male speaker. The dataset is good quality, the right frequency for the config, clean (no applauses or other noises in the background) and doesn’t have long pauses between sentences (at most 1 second).
After about 14 hours of fine-tuning, the model suddenly dropped in quality dramatically (see the alignment charts).
At around 1AM:
Afterwards:
I’m not exactly sure what happened and what would cause this, perhaps overfitting?
I’ve tried to synthesize some sentences using the last “good” checkpoint but it can only manage a few short words, not a full sentence.
I’m very new to this thing so apologies if I’ve missed something obvious.