Training suddenly dropping in quality

I’ve been trying to fine-tune the LJSpeech dataset (from the Tacotron-iter-260k branch) on a dataset of about 8 hours with a single male speaker. The dataset is good quality, the right frequency for the config, clean (no applauses or other noises in the background) and doesn’t have long pauses between sentences (at most 1 second).
After about 14 hours of fine-tuning, the model suddenly dropped in quality dramatically (see the alignment charts).

At around 1AM:

Afterwards:

I’m not exactly sure what happened and what would cause this, perhaps overfitting?
I’ve tried to synthesize some sentences using the last “good” checkpoint but it can only manage a few short words, not a full sentence.
I’m very new to this thing so apologies if I’ve missed something obvious.

Hi, does your license allow you to post dataset samples? What is the r in the config file? Maybe you can try training with a higher r, so if it is 2 try with 3. Because 8 hours should be enough when transfer learning.

Hey, thanks for the reply!
Sadly I’m not allowed to post dataset samples but let’s just say they’re nowhere close the LJSpeech dataset from the example notebooks, they’re very “robotic”.
The r value is 1, I’ll try with 3 instead and see if it improves things, thank you!

For reference, the config.json I used is:

{
“github_branch”:“dev-tacotron2”,
“restore_path”:"…/tts_models/best_model.pth.tar",
“run_name”: “my run”,
“run_description”: “finetune my run”,
“audio”:{
// Audio processing parameters
“num_mels”: 80, // size of the mel spec frame.
“num_freq”: 1025, // number of stft frequency levels. Size of the linear spectogram frame.
“sample_rate”: 22050, // wav sample-rate. If different than the original data, it is resampled.
“frame_length_ms”: 50, // stft window length in ms.
“frame_shift_ms”: 12.5, // stft window hop-lengh in ms.
“preemphasis”: 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
“min_level_db”: -100, // normalization range
“ref_level_db”: 20, // reference level db, theoretically 20db is the sound of air.
“power”: 1.5, // value to sharpen wav signals after GL algorithm.
“griffin_lim_iters”: 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// Normalization parameters
“signal_norm”: true, // normalize the spec values in range [0, 1]
“symmetric_norm”: false, // move normalization to range [-1, 1]
“max_norm”: 1, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
“clip_norm”: true, // clip normalized values into the range.
“mel_fmin”: 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
“mel_fmax”: 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
“do_trim_silence”: true // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
},
“distributed”:{
“backend”: “nccl”,
“url”: “tcp://localhost:54321”
},
“reinit_layers”: [], //set which layers to be reinitialized in finetunning. Only used if --restore_model is provided.
“model”: “Tacotron2”, // one of the model in models/
“grad_clip”: 1, // upper limit for gradients for clipping.
“epochs”: 1000, // total number of epochs to train.
“lr”: 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
“lr_decay”: false, // if true, Noam learning rate decaying is applied through training.
“warmup_steps”: 4000, // Noam decay steps to increase the learning rate from 0 to “lr”
“windowing”: false, // Enables attention windowing. Used only in eval mode.
“memory_size”: 5, // ONLY TACOTRON - memory queue size used to queue network predictions to feed autoregressive connection. Useful if r < 5.
“attention_norm”: “softmax”, // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
“prenet_type”: “bn”, // ONLY TACOTRON2 - “original” or “bn”.
“use_forward_attn”: true, // ONLY TACOTRON2 - if it uses forward attention. In general, it aligns faster.
“transition_agent”: false, // ONLY TACOTRON2 - enable/disable transition agent of forward attention.
“loss_masking”: false, // enable / disable loss masking against the sequence padding.
“enable_eos_bos_chars”: true, // enable/disable beginning of sentence and end of sentence chars.
“batch_size”: 16, // Batch size for training. Lower values than 32 might cause hard to learn attention.
“eval_batch_size”:16,
“r”: 1, // Number of frames to predict for step.
“wd”: 0.000001, // Weight decay weight.
“checkpoint”: true, // If true, it saves checkpoints per “save_step”
“save_step”: 1000, // Number of training steps expected to save traning stats and checkpoints.
“print_step”: 100, // Number of steps to log traning on console.
“tb_model_param_stats”: true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
“batch_group_size”: 8, // Number of batches to shuffle after bucketing.
“run_eval”: true,
“test_delay_epochs”: 2, //Until attention is aligned, testing only wastes computation time.
“data_path”: “…/my_data”, // DATASET-RELATED: can overwritten from command argument
“meta_file_train”: “metadata_train.csv”, // DATASET-RELATED: metafile for training dataloader.
“meta_file_val”: “metadata_val.csv”, // DATASET-RELATED: metafile for evaluation dataloader.
“dataset”: “ljspeech”, // DATASET-RELATED: one of TTS.dataset.preprocessors depending on your target dataset. Use “tts_cache” for pre-computed dataset by extract_features.py
“min_seq_len”: 0, // DATASET-RELATED: minimum text length to use in training
“max_seq_len”: 240, // DATASET-RELATED: maximum text length
“output_path”: “…/tts_models/outputs”, // DATASET-RELATED: output path for all training outputs.
“num_loader_workers”: 8, // number of training data loader processes. Don’t set it too big. 4-8 are good values.
“num_val_loader_workers”: 4, // number of evaluation data loader processes.
“phoneme_cache_path”: “ljspeech_phonemes”, // phoneme computation is slow, therefore, it caches results in the given folder.
“use_phonemes”: true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
“phoneme_language”: “en-us”, // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
“text_cleaner”: “phoneme_cleaners”
}

The problem is most likely here, if you say that some of the samples have parts of silence >= 1 second, the attention will certainly break. Try to set r = 2 and it should be fine.

1 Like

I’ve tried training again with r = 2 and it seems much worse than the previous run with r = 1. The alignment charts are empty all the way through, and the synthesized test audios are basically just white noise. It’s currently at epoch 79/1000 after training for ~16hrs

Try with original prenet type, instead of bn

No dice, same result, the test audios are just noise and the alignments charts are blank. I’ll try going back to r = 1 since that at least gave some results.

Are you sure your dataset is good quality? Is it studio? It seems very weird that you are not getting results.

What is the language?

I’m pretty sure it is good quality, it’s not studio but it’s a bunch of speeches with a good quality microphone and no background noise. The sentences are then split by silence and merged back together into 7-to-12 second segments.
The segments are annotated using Google’s Speech-to-text cloud tool which as far as I could see has minimal errors.

The language is American English.

Here is the initial output of the training in case it provides more info:

Setting up Audio Processor…
| > bits:None
| > sample_rate:22050
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:12.5
| > frame_length_ms:50
| > ref_level_db:20
| > num_freq:1025
| > power:1.5
| > preemphasis:0.98
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:False
| > mel_fmin:0.0
| > mel_fmax:8000.0
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:True
| > n_fft:2048
| > hop_length:275
| > win_length:1102
Using model: Tacotron2
| > Num output units : 1025
Model restored from step 261000

Model has 28151842 parameters

DataLoader initialization
| > Data path: …/tts_models/my_data
| > Use phonemes: True
| > phoneme language: en-us
| > Cached dataset: False
| > Number of instances : 2395
| > Max length sequence: 234
| > Min length sequence: 17
| > Avg length sequence: 133.70313152400834
| > Num. instances discarded by max-min seq limits: 0
| > Batch group size: 256.

Trained again for ~10 hours using r = 1 and prenet_type “bn” and after a while the alignment is just gone again, but at least the initial test audios have actual speech, although it can only get halfway through the sentence. Test audios seem to be better at the beginning of the training (say after 1500 steps or so) and then get progressively worse.

r = 2 and prenet_type “original” both (also separately) gave me just static noise and no alignment at any epoch.

I’m at a loss, I guess I could try training Tacotron2 directly from NVIDIA’s repository.

I think that Tacotron-iter-260k is trained using an older version of TTS. It should be compatible with the new one, but it may not be (also, I do not know if it is Taco or Taco2 - in my case Taco2 works much better and aligns much quicker). Otherwise, it is very weird you are not getting results. I always restore a pretrained model at 40k steps when I want to train a new TTS and it never fails to align after 10K steps.

You can try an older version, or try this one here https://github.com/Edresson/TTS/tree/voice-cloning. Alternatively, you can try finetuning this model https://github.com/mozilla/TTS/issues/345- That repo I am linking is the one I use and it works always. I have not tried the new TTS version.

1 Like

Thank you, I’ll try both and report back. Which model do you suggest I restore from when using Edresson’s repo? Or should I train from scratch?

I have no idea, but the one I linked above has worked for me in the past :smiley:

Back again with some results, I’ve tried both Edresson’s repo and the 670k model from the issue link, as well as just plain Tacotron2 from the NVIDIA repo:

  • Edresson’s repo didn’t really work as I couldn’t figure out what pretrained model would be compatible, I’ve tried training from scratch but it also gave no good results
  • The 670k model gave some results (after a bunch of fixes to the code, like the synthesizing of test audio files not working), however the voice turns out very high-pitched and noisy
  • The Tacotron2 model from NVIDIA works perfectly, the voice is very good and clean

Considering that it clearly isn’t a dataset problem, how is it possible that Mozilla’s TTS gives such bad results, when it’s using the same framework (Tacotron2)?

The documentation isn’t really clear on what the best branch/commit/model combo is, which one is the recommended setup for just fine-tuning an English voice ontop of a pre-trained model? I feel this project has a million branches and a million more combinations and none seem to work.

Have you checked any of the wiki entries? https://github.com/mozilla/TTS/wiki

Our released models are here from top to bottom older to newer models. https://github.com/mozilla/TTS/wiki/Released-Models

Are you sure your dataset specs match with your config?

There is also https://github.com/mozilla/TTS/wiki/FAQ to help.

I agree that things look not as tidy as it’s supposed to be. If you feel like there needs to be a change, feel free to send PRs and updates or at least come up with concrete suggestions.

In the worse case, if the other repo works, you can always use that one. They exist for a reason.

Edit:

I just checked the config.json

try instead

location_attn: true,
prenet: original,
use_forwarsd_attn: false

1 Like

As soon as I figure out how to make it work I’ll make a tutorial with what works for me.

Right now I’ve tried the Tacotron2 DDC model and fine-tuned it on my dataset, it finally trains properly but the results are kind of high-pitched/muffled after 8 hours, not sure if it requires training for longer?

Here are some statistics:

and here are some alignment charts:

Here is my config.json

My dataset is modelled after LJSpeech, it has 2462 wav files between 6 and 12 seconds in length, the speech is clear with no noise.

Two samples to compare the outputs:

DDC would not work with “ddc_r”: 1, set it to 7. And disable forward attention initially for the first run.

Trained again with “ddc_r” set to 7 and forward attention disabled, it’s marginally better, although still nowhere as good as the original Tacotron2.

I’ve trained for 8 hours using exactly these steps, of course with my own dataset instead of LJSpeech.

Some results:

In the meantime I have also fine-tuned WaveGlow for a run on another language I had trained on the Tacotron2 model from NVIDIA’s repo, and it worked perfectly, so again, I’m pretty sure I’m missing some parameter here somewhere.
:confused:

It sounds like the dataset has some background noise. Maybe you can try some preprocessing with audio software. Also, you could try and see if you can get good results training from scratch. 8 hours should definitely get you intelligible speech. Maybe when the frames drop to 2 it will be very erratic, but I remember training my first TTS on 5 hours of speech and it wasn’t clean either,.

I’ve cleaned up the dataset as much as I can, trained both the TTS and vocoder on the new dataset and I still get the same problem.
I’m giving up on this for now, I’ve been battling with it for more than a month, I’ll just stick with NVIDIA’s repos which work fine. Thank you all for the help.