Latest TTS

Hi all,
I was working with a TTS version I cloned about a year ago and was very impressed by the quality out-of-the-box. I was interested in testing out the latest version with multi-speaker and, after trying to do some controls on LJSpeech 1.1, the final samples from validation after 1k epochs sound much worse. I’ve tried matching the few different config settings (e.g. learning rate) but i am still getting pretty poor quality speech on all of LJSpeech over 1000 epochs.

I am trying to play with the attention settings, but has anyone gone through this? I don’t want to try LibriTTS until I am sure that I’ve got things working sensibly.

I should have been more verbose in my description, but I think that the ‘gradual learning’ was really degrading quality on LJSpeech 1.1. Setting:
“gradual_training”: null

seemed to do the trick (not 100% sure, there are a few other parameters that I changed).

can you post tensorboard outputs with and without gradual training?

Apologies for the delay in responding, I wanted to do a few more runs to try an establish things but I actually don’t think I have a good handle on what is going on. I successfully got the ~1yr old code base producing pretty reasonable quality speech on about 3 hours of training data augmented with phonemes (my own hacky code path, not with the way phonemes are currently implemented). I am struggling to get baseline performance with the current release on LJ Speech 1.1.

I tried simplifying the config the best I can and I still wind up with uninterpretable speech after 1000 epochs. In some configurations the network is producing good output midway through training (similar to what I got on the much older release) but I can see these same sort of loss spikes. I’ve attached my (3.2 KB) and a screen shot of loss

I must be doing something pretty stupid, can anyone spot it?

Just to check I’ve understood you correctly, you are struggling to get good results after 1000 epochs of the current release, ie the TTS repo as it is in the master branch (or dev?) as it is now? You mention some separate code with phonemes, but the way I’m reading what you wrote it seems that’s not part of the code you’re having trouble with is it?

It’s hard to know, but I suspect you’ll struggle because you’ve switched off the gradual training - you indicated that you did this because you weren’t getting good results, but what you’ve ended up with is doing all the training at r=5 which sounds to me like it would be limiting. Generally the quality of speech improves as you move to lower values of r (provided that the model managed the transition without things getting messed up)

How about reinstating the gradual training in your config? I’d be inclined to put it back in as per the config.json values, however I suppose you could experiment with not including all the levels, eg just go down to r=3 and see if that’s any better and then add back in the additional levels and retrain from a checkpoint a little before it reaches the point where it switches down to the next level.

Is there anything else that might have an impact you haven’t mentioned, such as not using all the LJSpeech dataset, limitations on your GPU memory, any errors in your logs, older/newer versions of Python modules installed than the recommended ones or any other unusual setup?

Oh, you’ve got this in your config:

"use_phonemes": false

That’ll make it struggle. Any reason you didn’t want that on?

And I suppose you could try turning on use_forward_attn.

And your tensorboard logs only show up to 100k but you previously talked about going up to 1000 - do you have the longer set?

‘Just to check I’ve understood you correctly, you are struggling to get good results after 1000 epochs of the current release, ie the TTS repo as it is in the master branch (or dev?) as it is now?’

Correct, struggling to get good results with the current master branch on LJ Speech 1.1. The bit about phonemes can be ignored, I only mentioned to try certify that things had been working nicely with this codebase without much fuss in my small hands, but something funny is happening now that I can’t quite isolate.

My initial TTS experiments were on dual 1080Tis, my new experiments are on quad RTX 2080 Tis with newer python/torch (python= 3.6.9, torch=1.3.0). I don’t think there is anything unusual/odd going on in terms of platform.

I’ll put gradual back in and play with r to see how I do.

I am playing with phonemes and I wanted to do some control testing, that’s why I have it off (I do have runs with it on and working, but I still get have this issue of loss spikes).

The old codebase (with no configurable attention or phonemes) did fine before, so I’m turning it off as a part of isolating what is going wrong. I think the horizontal axis is ok? ~90 steps per epoch * 1000 epochs?

For the sake of completeness, some of the graphs I see when I was training on a 3.3h dataset (with custom phonemeization) with ~1 year old code base (pre configurable attention settings and built-in phonemeization support). This looks healthier.

config snippet was pretty simple:
“num_freq”: 1025,
“sample_rate”: 22000,
“frame_length_ms”: 50,
“frame_shift_ms”: 10.0,
“preemphasis”: 0.97,
“min_level_db”: -100,
“ref_level_db”: 20,
“embedding_size”: 256,
“text_cleaner”: “null_cleaners”,

"num_loader_workers": 4,

"epochs": 1000,
"lr": 0.002,
"warmup_steps": 4000,
"lr_decay": 0.5,
"decay_step": 100000,
"batch_size": 32,
"r": 5,
"wd": 0.0001,

"griffin_lim_iters": 60,
"power": 1.5,

If anyone does have a ‘bare bones’ config (with the least reliance on any fancy new features) that they know works fine on LJ Speech with the current master, I’d love to see it.

Still getting these loss spikes across configurations.

I used graves attention model config from:

(attached mine just in case) (3.2 KB)

I pulled the most recent (9b97430a74fd5b43b5e0c0b11fddfeb38e60bd92) and now getting an odd NaN issue.

I’m going to sanity check whatever I can (data etc.), but if anyone can see the issue, I’d love to know.

| > TotalLoss: 6.85920 PostnetLoss: 0.12684 - 0.12684 DecoderLoss:0.21654 - 0.21654 StopLoss: 6.45301 - 6.45301 AlignScore: 0.4279 : 0.4279
warning: audio amplitude out of range, auto clipped.
| > Synthesizing test sentences
| > Training Loss: 0.15940 Validation Loss: 0.12559
Number of outputs per iteration: 3

Epoch 358/1000
| > Step:16/92 GlobalStep:21075 PostnetLoss:0.32882 DecoderLoss:0.45036 StopLoss:0.17568 AlignScore:0.1434 GradNorm:4.90582 GradNormST:2.12395 AvgTextLen:59.9 AvgSpecLen:336.6 StepTime:1.03 LoaderTime:0.02 LR:0.000100
| > Step:41/92 GlobalStep:21100 PostnetLoss:0.28766 DecoderLoss:0.40615 StopLoss:3.42795 AlignScore:0.1384 GradNorm:46714.50217 GradNormST:13.98048 AvgTextLen:93.1 AvgSpecLen:517.1 StepTime:1.37 LoaderTime:0.02 LR:0.000100
| > Step:66/92 GlobalStep:21125 PostnetLoss:0.42214 DecoderLoss:0.55323 StopLoss:0.23890 AlignScore:0.0948 GradNorm:354.74020 GradNormST:2.37551 AvgTextLen:116.8 AvgSpecLen:651.6 StepTime:1.67 LoaderTime:0.01 LR:0.000100
| > Gradient is INF !!
| > Gradient is INF !!
| > Gradient is INF !!
| > Gradient is INF !!
| > Step:91/92 GlobalStep:21150 PostnetLoss:nan DecoderLoss:nan StopLoss:nan AlignScore:nan GradNorm:nan GradNormST:nan AvgTextLen:149.2 AvgSpecLen:776.4 StepTime:0.81 LoaderTime:0.01 LR:0.000100
[WARNING] NaN or Inf found in input tensor.
[WARNING] NaN or Inf found in input tensor.
[WARNING] NaN or Inf found in input tensor.
[WARNING] NaN or Inf found in input tensor.
| > EPOCH END – GlobalStep:21151 AvgPostnetLoss:nan AvgDecoderLoss:nan AvgStopLoss:nan AvgAlignScore:nan EpochTime:120.15 AvgStepTime:1.29 AvgLoaderTime:0.03
[WARNING] NaN or Inf found in input tensor.
[WARNING] NaN or Inf found in input tensor.
[WARNING] NaN or Inf found in input tensor.
[WARNING] NaN or Inf found in input tensor.

| > TotalLoss: nan PostnetLoss: nan - nan DecoderLoss:nan - nan StopLoss: nan - nan AlignScore: nan : nan
! Run is kept in /nobackup/myadav1/TTS/output/mky_ljspeech_graves2-bn-March-06-2020_08+21PM-9b97430
Traceback (most recent call last):
File “”, line 715, in
File “”, line 634, in main
File “”, line 450, in evaluate
eval_audio = ap.inv_mel_spectrogram(const_spec.T)
File “/home/users/myadav/venvs/sri_tts/TTS/utils/”, line 174, in inv_mel_spectrogram
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
File “/home/users/myadav/venvs/sri_tts/TTS/utils/”, line 190, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File “/home/users/myadav/venvs/sri_tts/TTS/utils/”, line 199, in _stft
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/librosa/core/”, line 215, in stft
File “/home/users/myadav/.virtualenvs/sri_tts/lib/python3.6/site-packages/librosa/util/”, line 275, in valid_audio
raise ParameterError(‘Audio buffer is not finite everywhere’)
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

Ok, switching to dev and starting with the template config looks like typical healthy training. Still not quite sure how the stuff above is happening in master, but it seems to not be there in dev.