Data and training considerations to improve voice naturalness

The learning rate was what broke the model. lowered it to 10^-4 and now the training is going better. Just FYI. Might want to keep this in mind in case you’re finetuning.

1 Like

I explained a method to ease training here in a hastily written post (in an airport) http://www.erogol.com/gradual-training-with-tacotron-for-faster-convergence/

3 Likes

I’m trying out gradual training and it’s very helpful - I am in the middle of a series of runs now to test impact of other adjustments but my key points from gradual training with my dataset are:

  • Big help bringing down training time to get good r=2 results
  • Once it jumps to r=1 I’ve run into problems with stopping (Decoder stopped with ‘max_decoder_steps’) - I’m just seeing if running it for a lot longer helps (I’ll give it another ~12 hours or so)
  • On r=1 when it’s actually producing speech, what’s produced is much more life like (as expected) but until I resolve the stopping problems it’s not yet as usable as r=2 models

Thanks for the guide @erogol

1 Like

I’m short on time this evening so I’ll have to wait till Sunday for a fuller update, but I’ve managed to get reasonably good results with gradual training. Here’s a sample

However so far I’ve only ever managed to get a stable model with r=2 - once the gradual training progresses to r=1 it ends up breaking up most of the time. With one of my runs I did get some snippets of speech and they were remarkably realistic, but it couldn’t complete anything beyond a few words together.

I’ll give details of the config on Sunday, but they’ve varied the max character length between ~160 and 190 characters, both true and false for trimming silence, running for between 300k and 450k iterations on regular Tacotron with the exact settings @erogol gave for gradual training in the article.

2 Likes

could you post your tensorboard snipped?

After small changes I also train a new Tacotron2 with gradual way and soon to be released. So we’ll see how it behaves.

Hi @erogol - is this suitable?

I can post more screenshots focusing on any particular ones are of interest (or zooming in more). Here are the overall EvalStats and TrainEpochStats charts (for all four sets of runs together) along with the EvalFigures and TestFigures charts for the best run (in terms of audio quality for general usage)

All runs have:

  • “use_forward_attn”: false - as per this I train w/o it then turn on for inference; is that still sensible approach?
  • “location_attn”: true - left this untouched
  • had tuned based on CheckSpectrograms notebook

1st run
Orange + Red (continuation/fine-tuning)
neil14_october_v1-October-04-2019_02+28AM-3abf3a4

  • when fine-tuning (ie continuing training) with red line, I’d actually made a handful of minor corpus text corrections that were discovered after initial run (orange);

  • “max_seq_len”: 200

  • “do_trim_silence”: true

  • “gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] - followed the gradual training values provided

  • “memory_size”: -1 - had left this as default based on TTS/config.json, but later adjusted to 5 as saw TTS/config_tacotron.json had it higher

2nd run
Cyan
neil14_october_v3-October-06-2019_11+49PM-3abf3a4

  • “max_seq_len”: 195

  • “do_trim_silence”: true

  • “gradual_training” values unchanged from above

3rd run
Pink
neil14_october_v4-October-10-2019_12+32AM-3abf3a4

  • “max_seq_len”: 164

  • “do_trim_silence”: true

  • “gradual_training” values unchanged from above

  • some phoneme corrections in ESpeak

4th run
Turquoise
neil14_october_v4-October-12-2019_12+16AM-3abf3a4

  • “max_seq_len”: 164

  • “do_trim_silence”: false

  • some additional phoneme corrections in ESpeak

  • tried bigger batch size for later grad training (simply as it’d be faster,right? ; seems to have been fine)
    “gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 32], [290000, 1, 16]]

Observations: The best audio output is actually from the 2nd run (Cyan); the best model from 4th run seemed better (BEST MODEL (0.03737) vs BEST MODEL (0.08910)) but it was unusable during inference as never got any audio from it and it gives “Decoder stopped with 'max_decoder_steps” even on short phrases.
Also none of them could produce consistent output when it transitioned to r=1. The best results were all in r=2 stage

I can also say r=2 is better for my models but with noisy datasets like LJSpeech. I guess having lots of silences in a dataset is also a problem. With a professionally recorded dataset for especially TTS, there is no such problem. I guess, when it goes from r=2 to 1, silences also elongates and it gets attention hard to understand if it is the end.

Another point is the length of the sequence. So from r=2 to 1 makes a sequence 2 times longer for the decoder. It might makes things hard for attention RNN to learn goo representations.

1 Like

@nmstoker I can also tell that gradual training is loose when r=1 with LJSpeech. But I need to check with a better dataset to say something certain. However, tacotron2 looks much more robust against this shift.

1 Like

Do you have any recommendations for setting memory_size?

In the main branch in config_tacotron.json it’s set at 5 but in config.json (which is also updated slightly more recently) it’s set at -1 (ie not active)

In most of my runs mentioned above I’d left it at -1 and in my 4th run (which had fairly bad results) I’d switched it to 5 (I should’ve mentioned this but overlooked it). As I had varied some other settings on that worse run, I wondered if the bad results were more related to the other settings than memory_size and I might be missing out by reverting to using -1.

I’ve always trained all my models with memory_size kept at 5 and I’ve had good results and I’ve had sub-par results(where the model works decently for maybe 50% test sentences and for the rest it produces noise.) The key difference between these experiments were the dataset qualities. One had consistent volume and speaker characteristics, other was not so consistent. I am not sure what to conclude from this, just putting this info out here. ( which is why is switched to working on data normalization instead of hyperparameter tuning.)

1 Like

Thanks, I reckon I should switch to 5 then.

I agree that dataset quality is critical. I’d already weeded out a number of bad samples from mine along with some transcription errors.

Something I tried just recently that could be helpful for others is looking at clustering in my dataset’s audio samples using https://github.com/resemble-ai/Resemblyzer .

It creates embeddings for each voice sample, then I used UMAP as per one of the Resemblyzer demos (T-SNE could also work) and finally plotted the results in Bokeh along with a simple trick to make each plotted point a hyperlink to the audio file - that way I could target my focus (given I have nearly 20 hrs of audio!)

Am away from my computer till this evening, but I’ll post the basic code on a gist.

YMMV, but for me it was reasonably helpful as a general guide on where to look. Two main clusters emerged, with the largest for typically good quality audio and the smaller of the two containing samples that tended to have a slightly more raspy quality (and occasionally more major sound problems). I’ve cut out the worst cases and am training with that now. Given time I’ll also explore removing that whole more raspy cluster.

1 Like

Thank you so much for pointing this out! I was training my own auto encoder; this will save me a lot of time. I really appreciate it. Hopefully this will help me reach some conclusive and stable training.

1 Like

It is really smart. I’ve also implemented the same paper as the repo, multi-speaker training in mind. If I find some time I can release it under TTS.

2 Likes

Here’s the Jupyter Notebook for the Resemblyzer/Bokeh plot I mentioned above, in a gist: https://gist.github.com/nmstoker/0fe7b3e9beae6e608ee0626aef7f1800

You can ignore the substantial warning that comes from the UMAP reducer. Depending on the number of samples and the computer you use, it can take a while to run (so may be worth running through with a more limited number of .wav files initially just to be sure everything works). Takes about 40+ mins on my laptop.

When it has produced the scatter plot, navigate to the location of your wav files, and in that location start a local server (with the same port as used in the notebook):

python3 -m http.server

and then you should be able to click on the points in the chart and it’ll open them in another tab. I’ve seen people have code to make a Bokeh chart play the audio directly but I haven’t tried that yet (and this basically works well enough)

Here’s a screenshot of the scatter plot, with the two main clusters standing out quite clearly.

1 Like

Would you be willing to adapt your notebook to https://github.com/mozilla/TTS/tree/dev/speaker_encoder? That’d be a great contribution! I already have a model trained on LibriSpeech with 900 speakers that I can share

Yes, I’d be keen to give that a shot. I’ll have to look over the code there in a bit more detail and I’ll probably have a few questions.

Feel free to ask questions as you like :slight_smile:

Hey Neil,do you remember which part took longer to run? I am trying to speed things up.

Roughly it was the looping over all the wav files with preprocess_wav(wav_fpath) that took the most, but the next two steps also took a decent amount of time but not quite as long I think. I’ll be trying to get some time to look at it this evening or tomorrow evening, so if I get updated timings I can share those.

from multiprocessing import Pool
p=Pool(32)   # change number according to cores available
wavs=p.map(preprocess_wav,[i for i in wav_fpaths])

Try this out. should save you quite a bit of time.

P.S. you wont be seeing the tqdm prog bar though.

P.P.S.

This is pretty dope.

1 Like