Training suddenly dropping in quality

I’ve been trying to fine-tune the LJSpeech dataset (from the Tacotron-iter-260k branch) on a dataset of about 8 hours with a single male speaker. The dataset is good quality, the right frequency for the config, clean (no applauses or other noises in the background) and doesn’t have long pauses between sentences (at most 1 second).
After about 14 hours of fine-tuning, the model suddenly dropped in quality dramatically (see the alignment charts).

At around 1AM:

Afterwards:

I’m not exactly sure what happened and what would cause this, perhaps overfitting?
I’ve tried to synthesize some sentences using the last “good” checkpoint but it can only manage a few short words, not a full sentence.
I’m very new to this thing so apologies if I’ve missed something obvious.

Hi, does your license allow you to post dataset samples? What is the r in the config file? Maybe you can try training with a higher r, so if it is 2 try with 3. Because 8 hours should be enough when transfer learning.

Hey, thanks for the reply!
Sadly I’m not allowed to post dataset samples but let’s just say they’re nowhere close the LJSpeech dataset from the example notebooks, they’re very “robotic”.
The r value is 1, I’ll try with 3 instead and see if it improves things, thank you!

For reference, the config.json I used is:

{
“github_branch”:“dev-tacotron2”,
“restore_path”:"…/tts_models/best_model.pth.tar",
“run_name”: “my run”,
“run_description”: “finetune my run”,
“audio”:{
// Audio processing parameters
“num_mels”: 80, // size of the mel spec frame.
“num_freq”: 1025, // number of stft frequency levels. Size of the linear spectogram frame.
“sample_rate”: 22050, // wav sample-rate. If different than the original data, it is resampled.
“frame_length_ms”: 50, // stft window length in ms.
“frame_shift_ms”: 12.5, // stft window hop-lengh in ms.
“preemphasis”: 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
“min_level_db”: -100, // normalization range
“ref_level_db”: 20, // reference level db, theoretically 20db is the sound of air.
“power”: 1.5, // value to sharpen wav signals after GL algorithm.
“griffin_lim_iters”: 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// Normalization parameters
“signal_norm”: true, // normalize the spec values in range [0, 1]
“symmetric_norm”: false, // move normalization to range [-1, 1]
“max_norm”: 1, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
“clip_norm”: true, // clip normalized values into the range.
“mel_fmin”: 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
“mel_fmax”: 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
“do_trim_silence”: true // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
},
“distributed”:{
“backend”: “nccl”,
“url”: “tcp://localhost:54321”
},
“reinit_layers”: [], //set which layers to be reinitialized in finetunning. Only used if --restore_model is provided.
“model”: “Tacotron2”, // one of the model in models/
“grad_clip”: 1, // upper limit for gradients for clipping.
“epochs”: 1000, // total number of epochs to train.
“lr”: 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
“lr_decay”: false, // if true, Noam learning rate decaying is applied through training.
“warmup_steps”: 4000, // Noam decay steps to increase the learning rate from 0 to “lr”
“windowing”: false, // Enables attention windowing. Used only in eval mode.
“memory_size”: 5, // ONLY TACOTRON - memory queue size used to queue network predictions to feed autoregressive connection. Useful if r < 5.
“attention_norm”: “softmax”, // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
“prenet_type”: “bn”, // ONLY TACOTRON2 - “original” or “bn”.
“use_forward_attn”: true, // ONLY TACOTRON2 - if it uses forward attention. In general, it aligns faster.
“transition_agent”: false, // ONLY TACOTRON2 - enable/disable transition agent of forward attention.
“loss_masking”: false, // enable / disable loss masking against the sequence padding.
“enable_eos_bos_chars”: true, // enable/disable beginning of sentence and end of sentence chars.
“batch_size”: 16, // Batch size for training. Lower values than 32 might cause hard to learn attention.
“eval_batch_size”:16,
“r”: 1, // Number of frames to predict for step.
“wd”: 0.000001, // Weight decay weight.
“checkpoint”: true, // If true, it saves checkpoints per “save_step”
“save_step”: 1000, // Number of training steps expected to save traning stats and checkpoints.
“print_step”: 100, // Number of steps to log traning on console.
“tb_model_param_stats”: true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
“batch_group_size”: 8, // Number of batches to shuffle after bucketing.
“run_eval”: true,
“test_delay_epochs”: 2, //Until attention is aligned, testing only wastes computation time.
“data_path”: “…/my_data”, // DATASET-RELATED: can overwritten from command argument
“meta_file_train”: “metadata_train.csv”, // DATASET-RELATED: metafile for training dataloader.
“meta_file_val”: “metadata_val.csv”, // DATASET-RELATED: metafile for evaluation dataloader.
“dataset”: “ljspeech”, // DATASET-RELATED: one of TTS.dataset.preprocessors depending on your target dataset. Use “tts_cache” for pre-computed dataset by extract_features.py
“min_seq_len”: 0, // DATASET-RELATED: minimum text length to use in training
“max_seq_len”: 240, // DATASET-RELATED: maximum text length
“output_path”: “…/tts_models/outputs”, // DATASET-RELATED: output path for all training outputs.
“num_loader_workers”: 8, // number of training data loader processes. Don’t set it too big. 4-8 are good values.
“num_val_loader_workers”: 4, // number of evaluation data loader processes.
“phoneme_cache_path”: “ljspeech_phonemes”, // phoneme computation is slow, therefore, it caches results in the given folder.
“use_phonemes”: true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
“phoneme_language”: “en-us”, // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
“text_cleaner”: “phoneme_cleaners”
}

The problem is most likely here, if you say that some of the samples have parts of silence >= 1 second, the attention will certainly break. Try to set r = 2 and it should be fine.

1 Like

I’ve tried training again with r = 2 and it seems much worse than the previous run with r = 1. The alignment charts are empty all the way through, and the synthesized test audios are basically just white noise. It’s currently at epoch 79/1000 after training for ~16hrs

Try with original prenet type, instead of bn

No dice, same result, the test audios are just noise and the alignments charts are blank. I’ll try going back to r = 1 since that at least gave some results.

Are you sure your dataset is good quality? Is it studio? It seems very weird that you are not getting results.

What is the language?

I’m pretty sure it is good quality, it’s not studio but it’s a bunch of speeches with a good quality microphone and no background noise. The sentences are then split by silence and merged back together into 7-to-12 second segments.
The segments are annotated using Google’s Speech-to-text cloud tool which as far as I could see has minimal errors.

The language is American English.

Here is the initial output of the training in case it provides more info:

Setting up Audio Processor…
| > bits:None
| > sample_rate:22050
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:12.5
| > frame_length_ms:50
| > ref_level_db:20
| > num_freq:1025
| > power:1.5
| > preemphasis:0.98
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:False
| > mel_fmin:0.0
| > mel_fmax:8000.0
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:True
| > n_fft:2048
| > hop_length:275
| > win_length:1102
Using model: Tacotron2
| > Num output units : 1025
Model restored from step 261000

Model has 28151842 parameters

DataLoader initialization
| > Data path: …/tts_models/my_data
| > Use phonemes: True
| > phoneme language: en-us
| > Cached dataset: False
| > Number of instances : 2395
| > Max length sequence: 234
| > Min length sequence: 17
| > Avg length sequence: 133.70313152400834
| > Num. instances discarded by max-min seq limits: 0
| > Batch group size: 256.

Trained again for ~10 hours using r = 1 and prenet_type “bn” and after a while the alignment is just gone again, but at least the initial test audios have actual speech, although it can only get halfway through the sentence. Test audios seem to be better at the beginning of the training (say after 1500 steps or so) and then get progressively worse.

r = 2 and prenet_type “original” both (also separately) gave me just static noise and no alignment at any epoch.

I’m at a loss, I guess I could try training Tacotron2 directly from NVIDIA’s repository.

I think that Tacotron-iter-260k is trained using an older version of TTS. It should be compatible with the new one, but it may not be (also, I do not know if it is Taco or Taco2 - in my case Taco2 works much better and aligns much quicker). Otherwise, it is very weird you are not getting results. I always restore a pretrained model at 40k steps when I want to train a new TTS and it never fails to align after 10K steps.

You can try an older version, or try this one here https://github.com/Edresson/TTS/tree/voice-cloning. Alternatively, you can try finetuning this model https://github.com/mozilla/TTS/issues/345- That repo I am linking is the one I use and it works always. I have not tried the new TTS version.

1 Like

Thank you, I’ll try both and report back. Which model do you suggest I restore from when using Edresson’s repo? Or should I train from scratch?

I have no idea, but the one I linked above has worked for me in the past :smiley: