Data and training considerations to improve voice naturalness

nmstoker · September 19, 2019, 12:52pm

I’m keen to discuss what people have been considering in regard to data and training approaches to improve voice quality (naturalness of audio) and overall capabilities.

I’ve read wiki Dataset page and played around with the notebooks and they were helpful. I also realise a big improvement comes from increasing the size of my dataset (it got radically better between 6 hrs and when I got it to 12-13 hrs) and am pushing on to increase that further, but I also wanted to think about ways I could direct my efforts best.

The phoneme coverage as mentioned on the wiki seems critical, so I’ve started getting stats to show how well (or poorly!) my dataset represents general English speech. And I’m also looking at how well the Espeak backend converts the words in my dataset to phonemes (since if it has words that are either wrong or markedly off my dataset pronunciation, it’ll undermine the model’s ability to learn well)

One area I’m particularly keen to hear the thoughts of others on is whether there’s any advantage to the following:

Initially training with a much simpler subset of my data
Then fine-tuning with a broader set

or

Whether it’s best just to start with everything from the start.

My (naïve) intuition here is that babies start with simple words and build up. I could probably limit the length of training sentences to those with under a certain short length of characters or better still single short words (although my dataset probably has those a little skewed as I’ve not really got that many single word sentences). Has anyone tried something similar or seen any commentary on this kind of thing elsewhere?

alchemi5t · September 19, 2019, 2:48pm

Hey NMS,

I’ve had a hard time getting a decent model with r=1(batchsize=96), so I did initially train with a 20% subset of my data(~4 hrs) and it did align much quicker and gave me better results than r>1 but it is still slightly unnatural. Now that I have this model, I am going to bootstrap it and train it on my entire dataset(batchsize=32).

Hoping for a better wavernn model with this taco2. My previous waveRNN was on taco2(r=2) which was pretty good but had a shaky weak voice/whistling every now and then. Need to figure out how to build a consistent model.

Will try LPCnet next.
@carlfm01 Any insights on what architectures for the neural vocoders were most reliable?

carlfm01 · September 19, 2019, 8:17pm

Why 96? Well that explains your OOM with short sentences.

Basically just WaveRNN and WaveNet in terms of quality, there was a loot of effort to adapt WORLD vocoder and tacotron, and others that didn’t end well.

github.com/Rayhane-mamah/Tacotron-2

Tacoton-2 plus World vocoder

opened 03:25AM - 28 Dec 18 UTC

closed 07:59AM - 16 May 19 UTC

begeekmyfriend

Hey I am glad to inform you that I have succeeded to merge Tacotron model with W…orld vocoder and generated some evaluation results as follows. The results sound not bad but still not perfect. However it shows another way to train different feature parameters with Tacotron. The World vocoder is an open source project and thus everyone can use it for all. Moreover the quality of resynth results from that vocoder is better than that from Griffin-Lim since the three features (lf0[1], mgc[60] and ap[5]) contain not only magnitude spectrograms but also phase information. Furthermore the depth of the features is low enough that we do not need postnet for Tacotron model. The performance of training can be reduced to 0.7 second per step. The inference can also be quick enough even it only works on CPU. So it really worthes trying. I would like to share my experimental source code with you as follows. Note that it currently only for Chinese mandarin. You may modify it for other languages: [tacotron-world-vocoder branch](https://github.com/begeekmyfriend/tacotron/tree/mandarin-world-vocoder) [Python-Wrapper-for-World-Vocoder](https://github.com/begeekmyfriend/Python-Wrapper-for-World-Vocoder) [pysptk merlin-world-vocoder branch](https://github.com/begeekmyfriend/pysptk/tree/merlin-world-vocoder) By the way you need use `python setup.py install` and the copy the so file manually into the system path for `pysptk` and python wrapper project. Besides I also would like to provide two Python scripts for World vocoder resynth test. [world_vocoder_resynth_scripts.zip](https://github.com/Rayhane-mamah/Tacotron-2/files/2982117/world_vocoder_resynth_scripts.zip) @Rayhane-mamah Let us rock with it! And @r9y9 thanks for your `pysptk` project. [world_vocoder_demo.zip](https://github.com/Rayhane-mamah/Tacotron-2/files/2713888/world_vocoder_demo.zip) ![image](https://user-images.githubusercontent.com/6031938/50501258-ca2d1a80-0a91-11e9-8937-d607bd96f52e.png)

And now (for my use case) the most reliable is becoming LPCNet which at the end is an adaption of WaveRNN. The issue with LPCNet is that the users that shared good quality speech didn’t share much details of the versions they used or the adaptations they made.

This is a good summary about TTS:
http://www.erogol.com/text-speech-deep-learning-architectures/

To avoid my pain few steps:
First read the paper : https://jmvalin.ca/papers/lpcnet_icassp2019.pdf
Read this issue : How to perform text to speech · Issue #4 · xiph/LPCNet · GitHub
Use this fork of LPCNet, as mentioned in the issue we don’t need to predict 50d but 20d, this fork is able to extract only 20d to train tacotron2:
GitHub - MlWoo/LPCNet: Efficient neural speech synthesis
Read carefully the readme and the commit history.
GitHub - MlWoo/LPCNet: Efficient neural speech synthesis
And for tacotron2 use my fork with the spanish branch:
GitHub - carlfm01/Tacotron-2: DeepMind's Tacotron-2 Tensorflow implementation
You need to change the symbols and paths

carlfm01 · September 19, 2019, 9:48pm

Back to the question of the post, @nmstoker your suggestion is called “curriculum learning”, to read more about it https://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf
Erogol did something similar but for the decoder steps, lowering r on a scheduled way.

nmstoker · September 19, 2019, 9:51pm

Thanks @alchemi5t
When you cut to 20% did you just take an arbitrary sample or was there any particular process to select the items to include/exclude?

nmstoker · September 19, 2019, 10:11pm

Thanks also @carlfm01 - I’ve skimmed that paper and will read it properly in the morning.

alchemi5t · September 20, 2019, 3:34am

Confession time, I did not know the batch size in the config was not the effective batch size. So, when i picked 32 and 3 GPUs, I unintentionally trained a model on batchsize 96 which started giving me expected results.
But the thing is, when i wanted to train on my entire dataset with batch size 32, It doesn’t generate anything but static again; I’ll train it for another 3-4 days and see what the deal is, unless you have any other suggestions.

The 20% cut was actually to try and fit the batch size in memory and not to build a weak model which i’d use to later build a stronger one. Coincidentally that weak model was the best model I’ve trained on r=1 and It was later when i decided to use it as a bootstrap.

for the 20% cut, The only heuristic i used was to take only sentences with length<50. My dataset has sentences with length upto 469 and only 20% of my data made this cut.

As soon as i am done with building a decent taco2, I’ll follow this!

carlfm01 · September 20, 2019, 4:30am

I’m afraid I don’t have any other recommendation

alchemi5t · September 20, 2019, 5:19pm

The learning rate was what broke the model. lowered it to 10^-4 and now the training is going better. Just FYI. Might want to keep this in mind in case you’re finetuning.

erogol · September 21, 2019, 10:49pm

I explained a method to ease training here in a hastily written post (in an airport) http://www.erogol.com/gradual-training-with-tacotron-for-faster-convergence/

nmstoker · October 9, 2019, 10:03am

I’m trying out gradual training and it’s very helpful - I am in the middle of a series of runs now to test impact of other adjustments but my key points from gradual training with my dataset are:

Big help bringing down training time to get good r=2 results
Once it jumps to r=1 I’ve run into problems with stopping (Decoder stopped with ‘max_decoder_steps’) - I’m just seeing if running it for a lot longer helps (I’ll give it another ~12 hours or so)
On r=1 when it’s actually producing speech, what’s produced is much more life like (as expected) but until I resolve the stopping problems it’s not yet as usable as r=2 models

Thanks for the guide @erogol

nmstoker · October 11, 2019, 10:39pm

I’m short on time this evening so I’ll have to wait till Sunday for a fuller update, but I’ve managed to get reasonably good results with gradual training. Here’s a sample

However so far I’ve only ever managed to get a stable model with r=2 - once the gradual training progresses to r=1 it ends up breaking up most of the time. With one of my runs I did get some snippets of speech and they were remarkably realistic, but it couldn’t complete anything beyond a few words together.

I’ll give details of the config on Sunday, but they’ve varied the max character length between ~160 and 190 characters, both true and false for trimming silence, running for between 300k and 450k iterations on regular Tacotron with the exact settings @erogol gave for gradual training in the article.

erogol · October 12, 2019, 4:57pm

could you post your tensorboard snipped?

After small changes I also train a new Tacotron2 with gradual way and soon to be released. So we’ll see how it behaves.

nmstoker · October 16, 2019, 12:13pm

Hi @erogol - is this suitable?

I can post more screenshots focusing on any particular ones are of interest (or zooming in more). Here are the overall EvalStats and TrainEpochStats charts (for all four sets of runs together) along with the EvalFigures and TestFigures charts for the best run (in terms of audio quality for general usage)

All runs have:

“use_forward_attn”: false - as per this I train w/o it then turn on for inference; is that still sensible approach?
“location_attn”: true - left this untouched
had tuned based on CheckSpectrograms notebook

1st run
Orange + Red (continuation/fine-tuning)
neil14_october_v1-October-04-2019_02+28AM-3abf3a4

when fine-tuning (ie continuing training) with red line, I’d actually made a handful of minor corpus text corrections that were discovered after initial run (orange);
“max_seq_len”: 200
“do_trim_silence”: true
“gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] - followed the gradual training values provided
“memory_size”: -1 - had left this as default based on TTS/config.json, but later adjusted to 5 as saw TTS/config_tacotron.json had it higher

2nd run
Cyan
neil14_october_v3-October-06-2019_11+49PM-3abf3a4

“max_seq_len”: 195
“do_trim_silence”: true
“gradual_training” values unchanged from above

3rd run
Pink
neil14_october_v4-October-10-2019_12+32AM-3abf3a4

“max_seq_len”: 164
“do_trim_silence”: true
“gradual_training” values unchanged from above
some phoneme corrections in ESpeak

4th run
Turquoise
neil14_october_v4-October-12-2019_12+16AM-3abf3a4

“max_seq_len”: 164
“do_trim_silence”: false
some additional phoneme corrections in ESpeak
tried bigger batch size for later grad training (simply as it’d be faster,right? ; seems to have been fine)
“gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 32], [290000, 1, 16]]

Observations: The best audio output is actually from the 2nd run (Cyan); the best model from 4th run seemed better (BEST MODEL (0.03737) vs BEST MODEL (0.08910)) but it was unusable during inference as never got any audio from it and it gives “Decoder stopped with 'max_decoder_steps” even on short phrases.
Also none of them could produce consistent output when it transitioned to r=1. The best results were all in r=2 stage

erogol · October 17, 2019, 11:37pm

I can also say r=2 is better for my models but with noisy datasets like LJSpeech. I guess having lots of silences in a dataset is also a problem. With a professionally recorded dataset for especially TTS, there is no such problem. I guess, when it goes from r=2 to 1, silences also elongates and it gets attention hard to understand if it is the end.

Another point is the length of the sequence. So from r=2 to 1 makes a sequence 2 times longer for the decoder. It might makes things hard for attention RNN to learn goo representations.

erogol · October 22, 2019, 5:54pm

@nmstoker I can also tell that gradual training is loose when r=1 with LJSpeech. But I need to check with a better dataset to say something certain. However, tacotron2 looks much more robust against this shift.

nmstoker · October 30, 2019, 10:31pm

Do you have any recommendations for setting memory_size?

In the main branch in config_tacotron.json it’s set at 5 but in config.json (which is also updated slightly more recently) it’s set at -1 (ie not active)

In most of my runs mentioned above I’d left it at -1 and in my 4th run (which had fairly bad results) I’d switched it to 5 (I should’ve mentioned this but overlooked it). As I had varied some other settings on that worse run, I wondered if the bad results were more related to the other settings than memory_size and I might be missing out by reverting to using -1.

alchemi5t · October 31, 2019, 4:15am

I’ve always trained all my models with memory_size kept at 5 and I’ve had good results and I’ve had sub-par results(where the model works decently for maybe 50% test sentences and for the rest it produces noise.) The key difference between these experiments were the dataset qualities. One had consistent volume and speaker characteristics, other was not so consistent. I am not sure what to conclude from this, just putting this info out here. ( which is why is switched to working on data normalization instead of hyperparameter tuning.)

nmstoker · October 31, 2019, 1:22pm

Thanks, I reckon I should switch to 5 then.

I agree that dataset quality is critical. I’d already weeded out a number of bad samples from mine along with some transcription errors.

Something I tried just recently that could be helpful for others is looking at clustering in my dataset’s audio samples using https://github.com/resemble-ai/Resemblyzer .

It creates embeddings for each voice sample, then I used UMAP as per one of the Resemblyzer demos (T-SNE could also work) and finally plotted the results in Bokeh along with a simple trick to make each plotted point a hyperlink to the audio file - that way I could target my focus (given I have nearly 20 hrs of audio!)

Am away from my computer till this evening, but I’ll post the basic code on a gist.

YMMV, but for me it was reasonably helpful as a general guide on where to look. Two main clusters emerged, with the largest for typically good quality audio and the smaller of the two containing samples that tended to have a slightly more raspy quality (and occasionally more major sound problems). I’ve cut out the worst cases and am training with that now. Given time I’ll also explore removing that whole more raspy cluster.

alchemi5t · October 31, 2019, 1:39pm

Thank you so much for pointing this out! I was training my own auto encoder; this will save me a lot of time. I really appreciate it. Hopefully this will help me reach some conclusive and stable training.

Topic		Replies	Views
Fine-Tuning Trained Model to New Dataset TTS (Text-to-Speech)	13	4899	August 22, 2019
Are we now Headed towards the Brightest future in training efficiency? TTS (Text-to-Speech) feedback	4	798	July 7, 2020
Training suddenly dropping in quality TTS (Text-to-Speech)	20	2428	August 18, 2020
Data Requirements for Fine Tuning LJ Speech to learn my voice in English TTS (Text-to-Speech)	1	749	September 1, 2020
My Success with Mozilla TTS TTS (Text-to-Speech)	7	7103	January 21, 2021

Data and training considerations to improve voice naturalness

Related topics