Data and training considerations to improve voice naturalness

I’m keen to discuss what people have been considering in regard to data and training approaches to improve voice quality (naturalness of audio) and overall capabilities.

I’ve read wiki Dataset page and played around with the notebooks and they were helpful. I also realise a big improvement comes from increasing the size of my dataset (it got radically better between 6 hrs and when I got it to 12-13 hrs) and am pushing on to increase that further, but I also wanted to think about ways I could direct my efforts best.

The phoneme coverage as mentioned on the wiki seems critical, so I’ve started getting stats to show how well (or poorly!) my dataset represents general English speech. And I’m also looking at how well the Espeak backend converts the words in my dataset to phonemes (since if it has words that are either wrong or markedly off my dataset pronunciation, it’ll undermine the model’s ability to learn well)

One area I’m particularly keen to hear the thoughts of others on is whether there’s any advantage to the following:

  1. Initially training with a much simpler subset of my data
  2. Then fine-tuning with a broader set


  • Whether it’s best just to start with everything from the start.

My (naïve) intuition here is that babies start with simple words and build up. I could probably limit the length of training sentences to those with under a certain short length of characters or better still single short words (although my dataset probably has those a little skewed as I’ve not really got that many single word sentences). Has anyone tried something similar or seen any commentary on this kind of thing elsewhere?

Hey NMS,

I’ve had a hard time getting a decent model with r=1(batchsize=96), so I did initially train with a 20% subset of my data(~4 hrs) and it did align much quicker and gave me better results than r>1 but it is still slightly unnatural. Now that I have this model, I am going to bootstrap it and train it on my entire dataset(batchsize=32).

Hoping for a better wavernn model with this taco2. My previous waveRNN was on taco2(r=2) which was pretty good but had a shaky weak voice/whistling every now and then. Need to figure out how to build a consistent model.

Will try LPCnet next.
@carlfm01 Any insights on what architectures for the neural vocoders were most reliable?

1 Like

Why 96? Well that explains your OOM with short sentences.

Basically just WaveRNN and WaveNet in terms of quality, there was a loot of effort to adapt WORLD vocoder and tacotron, and others that didn’t end well.

And now (for my use case) the most reliable is becoming LPCNet which at the end is an adaption of WaveRNN. The issue with LPCNet is that the users that shared good quality speech didn’t share much details of the versions they used or the adaptations they made.

This is a good summary about TTS:

To avoid my pain few steps:
First read the paper :
Read this issue :
Use this fork of LPCNet, as mentioned in the issue we don’t need to predict 50d but 20d, this fork is able to extract only 20d to train tacotron2:
Read carefully the readme and the commit history.
And for tacotron2 use my fork with the spanish branch:
You need to change the symbols and paths


Back to the question of the post, @nmstoker your suggestion is called “curriculum learning”, to read more about it
Erogol did something similar but for the decoder steps, lowering r on a scheduled way.

1 Like

Thanks @alchemi5t
When you cut to 20% did you just take an arbitrary sample or was there any particular process to select the items to include/exclude?

Thanks also @carlfm01 - I’ve skimmed that paper and will read it properly in the morning.

Confession time, I did not know the batch size in the config was not the effective batch size. So, when i picked 32 and 3 GPUs, I unintentionally trained a model on batchsize 96 which started giving me expected results.
But the thing is, when i wanted to train on my entire dataset with batch size 32, It doesn’t generate anything but static again; I’ll train it for another 3-4 days and see what the deal is, unless you have any other suggestions.

The 20% cut was actually to try and fit the batch size in memory and not to build a weak model which i’d use to later build a stronger one. Coincidentally that weak model was the best model I’ve trained on r=1 and It was later when i decided to use it as a bootstrap.

for the 20% cut, The only heuristic i used was to take only sentences with length<50. My dataset has sentences with length upto 469 and only 20% of my data made this cut.

As soon as i am done with building a decent taco2, I’ll follow this!

I’m afraid I don’t have any other recommendation :confused:

The learning rate was what broke the model. lowered it to 10^-4 and now the training is going better. Just FYI. Might want to keep this in mind in case you’re finetuning.

1 Like

I explained a method to ease training here in a hastily written post (in an airport)


I’m trying out gradual training and it’s very helpful - I am in the middle of a series of runs now to test impact of other adjustments but my key points from gradual training with my dataset are:

  • Big help bringing down training time to get good r=2 results
  • Once it jumps to r=1 I’ve run into problems with stopping (Decoder stopped with ‘max_decoder_steps’) - I’m just seeing if running it for a lot longer helps (I’ll give it another ~12 hours or so)
  • On r=1 when it’s actually producing speech, what’s produced is much more life like (as expected) but until I resolve the stopping problems it’s not yet as usable as r=2 models

Thanks for the guide @erogol

1 Like

I’m short on time this evening so I’ll have to wait till Sunday for a fuller update, but I’ve managed to get reasonably good results with gradual training. Here’s a sample

However so far I’ve only ever managed to get a stable model with r=2 - once the gradual training progresses to r=1 it ends up breaking up most of the time. With one of my runs I did get some snippets of speech and they were remarkably realistic, but it couldn’t complete anything beyond a few words together.

I’ll give details of the config on Sunday, but they’ve varied the max character length between ~160 and 190 characters, both true and false for trimming silence, running for between 300k and 450k iterations on regular Tacotron with the exact settings @erogol gave for gradual training in the article.


could you post your tensorboard snipped?

After small changes I also train a new Tacotron2 with gradual way and soon to be released. So we’ll see how it behaves.

Hi @erogol - is this suitable?

I can post more screenshots focusing on any particular ones are of interest (or zooming in more). Here are the overall EvalStats and TrainEpochStats charts (for all four sets of runs together) along with the EvalFigures and TestFigures charts for the best run (in terms of audio quality for general usage)

All runs have:

  • “use_forward_attn”: false - as per this I train w/o it then turn on for inference; is that still sensible approach?
  • “location_attn”: true - left this untouched
  • had tuned based on CheckSpectrograms notebook

1st run
Orange + Red (continuation/fine-tuning)

  • when fine-tuning (ie continuing training) with red line, I’d actually made a handful of minor corpus text corrections that were discovered after initial run (orange);

  • “max_seq_len”: 200

  • “do_trim_silence”: true

  • “gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] - followed the gradual training values provided

  • “memory_size”: -1 - had left this as default based on TTS/config.json, but later adjusted to 5 as saw TTS/config_tacotron.json had it higher

2nd run

  • “max_seq_len”: 195

  • “do_trim_silence”: true

  • “gradual_training” values unchanged from above

3rd run

  • “max_seq_len”: 164

  • “do_trim_silence”: true

  • “gradual_training” values unchanged from above

  • some phoneme corrections in ESpeak

4th run

  • “max_seq_len”: 164

  • “do_trim_silence”: false

  • some additional phoneme corrections in ESpeak

  • tried bigger batch size for later grad training (simply as it’d be faster,right? ; seems to have been fine)
    “gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 32], [290000, 1, 16]]

Observations: The best audio output is actually from the 2nd run (Cyan); the best model from 4th run seemed better (BEST MODEL (0.03737) vs BEST MODEL (0.08910)) but it was unusable during inference as never got any audio from it and it gives “Decoder stopped with 'max_decoder_steps” even on short phrases.
Also none of them could produce consistent output when it transitioned to r=1. The best results were all in r=2 stage

I can also say r=2 is better for my models but with noisy datasets like LJSpeech. I guess having lots of silences in a dataset is also a problem. With a professionally recorded dataset for especially TTS, there is no such problem. I guess, when it goes from r=2 to 1, silences also elongates and it gets attention hard to understand if it is the end.

Another point is the length of the sequence. So from r=2 to 1 makes a sequence 2 times longer for the decoder. It might makes things hard for attention RNN to learn goo representations.

1 Like