Data and training considerations to improve voice naturalness

Hi @erogol - is this suitable?

I can post more screenshots focusing on any particular ones are of interest (or zooming in more). Here are the overall EvalStats and TrainEpochStats charts (for all four sets of runs together) along with the EvalFigures and TestFigures charts for the best run (in terms of audio quality for general usage)

All runs have:

  • “use_forward_attn”: false - as per this I train w/o it then turn on for inference; is that still sensible approach?
  • “location_attn”: true - left this untouched
  • had tuned based on CheckSpectrograms notebook

1st run
Orange + Red (continuation/fine-tuning)
neil14_october_v1-October-04-2019_02+28AM-3abf3a4

  • when fine-tuning (ie continuing training) with red line, I’d actually made a handful of minor corpus text corrections that were discovered after initial run (orange);

  • “max_seq_len”: 200

  • “do_trim_silence”: true

  • “gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] - followed the gradual training values provided

  • “memory_size”: -1 - had left this as default based on TTS/config.json, but later adjusted to 5 as saw TTS/config_tacotron.json had it higher

2nd run
Cyan
neil14_october_v3-October-06-2019_11+49PM-3abf3a4

  • “max_seq_len”: 195

  • “do_trim_silence”: true

  • “gradual_training” values unchanged from above

3rd run
Pink
neil14_october_v4-October-10-2019_12+32AM-3abf3a4

  • “max_seq_len”: 164

  • “do_trim_silence”: true

  • “gradual_training” values unchanged from above

  • some phoneme corrections in ESpeak

4th run
Turquoise
neil14_october_v4-October-12-2019_12+16AM-3abf3a4

  • “max_seq_len”: 164

  • “do_trim_silence”: false

  • some additional phoneme corrections in ESpeak

  • tried bigger batch size for later grad training (simply as it’d be faster,right? ; seems to have been fine)
    “gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 32], [290000, 1, 16]]

Observations: The best audio output is actually from the 2nd run (Cyan); the best model from 4th run seemed better (BEST MODEL (0.03737) vs BEST MODEL (0.08910)) but it was unusable during inference as never got any audio from it and it gives “Decoder stopped with 'max_decoder_steps” even on short phrases.
Also none of them could produce consistent output when it transitioned to r=1. The best results were all in r=2 stage

I can also say r=2 is better for my models but with noisy datasets like LJSpeech. I guess having lots of silences in a dataset is also a problem. With a professionally recorded dataset for especially TTS, there is no such problem. I guess, when it goes from r=2 to 1, silences also elongates and it gets attention hard to understand if it is the end.

Another point is the length of the sequence. So from r=2 to 1 makes a sequence 2 times longer for the decoder. It might makes things hard for attention RNN to learn goo representations.

1 Like

@nmstoker I can also tell that gradual training is loose when r=1 with LJSpeech. But I need to check with a better dataset to say something certain. However, tacotron2 looks much more robust against this shift.

1 Like

Do you have any recommendations for setting memory_size?

In the main branch in config_tacotron.json it’s set at 5 but in config.json (which is also updated slightly more recently) it’s set at -1 (ie not active)

In most of my runs mentioned above I’d left it at -1 and in my 4th run (which had fairly bad results) I’d switched it to 5 (I should’ve mentioned this but overlooked it). As I had varied some other settings on that worse run, I wondered if the bad results were more related to the other settings than memory_size and I might be missing out by reverting to using -1.

I’ve always trained all my models with memory_size kept at 5 and I’ve had good results and I’ve had sub-par results(where the model works decently for maybe 50% test sentences and for the rest it produces noise.) The key difference between these experiments were the dataset qualities. One had consistent volume and speaker characteristics, other was not so consistent. I am not sure what to conclude from this, just putting this info out here. ( which is why is switched to working on data normalization instead of hyperparameter tuning.)

1 Like

Thanks, I reckon I should switch to 5 then.

I agree that dataset quality is critical. I’d already weeded out a number of bad samples from mine along with some transcription errors.

Something I tried just recently that could be helpful for others is looking at clustering in my dataset’s audio samples using https://github.com/resemble-ai/Resemblyzer .

It creates embeddings for each voice sample, then I used UMAP as per one of the Resemblyzer demos (T-SNE could also work) and finally plotted the results in Bokeh along with a simple trick to make each plotted point a hyperlink to the audio file - that way I could target my focus (given I have nearly 20 hrs of audio!)

Am away from my computer till this evening, but I’ll post the basic code on a gist.

YMMV, but for me it was reasonably helpful as a general guide on where to look. Two main clusters emerged, with the largest for typically good quality audio and the smaller of the two containing samples that tended to have a slightly more raspy quality (and occasionally more major sound problems). I’ve cut out the worst cases and am training with that now. Given time I’ll also explore removing that whole more raspy cluster.

1 Like

Thank you so much for pointing this out! I was training my own auto encoder; this will save me a lot of time. I really appreciate it. Hopefully this will help me reach some conclusive and stable training.

1 Like

It is really smart. I’ve also implemented the same paper as the repo, multi-speaker training in mind. If I find some time I can release it under TTS.

2 Likes

Here’s the Jupyter Notebook for the Resemblyzer/Bokeh plot I mentioned above, in a gist: https://gist.github.com/nmstoker/0fe7b3e9beae6e608ee0626aef7f1800

You can ignore the substantial warning that comes from the UMAP reducer. Depending on the number of samples and the computer you use, it can take a while to run (so may be worth running through with a more limited number of .wav files initially just to be sure everything works). Takes about 40+ mins on my laptop.

When it has produced the scatter plot, navigate to the location of your wav files, and in that location start a local server (with the same port as used in the notebook):

python3 -m http.server

and then you should be able to click on the points in the chart and it’ll open them in another tab. I’ve seen people have code to make a Bokeh chart play the audio directly but I haven’t tried that yet (and this basically works well enough)

Here’s a screenshot of the scatter plot, with the two main clusters standing out quite clearly.

1 Like

Would you be willing to adapt your notebook to https://github.com/mozilla/TTS/tree/dev/speaker_encoder? That’d be a great contribution! I already have a model trained on LibriSpeech with 900 speakers that I can share

Yes, I’d be keen to give that a shot. I’ll have to look over the code there in a bit more detail and I’ll probably have a few questions.

Feel free to ask questions as you like :slight_smile:

Hey Neil,do you remember which part took longer to run? I am trying to speed things up.

Roughly it was the looping over all the wav files with preprocess_wav(wav_fpath) that took the most, but the next two steps also took a decent amount of time but not quite as long I think. I’ll be trying to get some time to look at it this evening or tomorrow evening, so if I get updated timings I can share those.

from multiprocessing import Pool
p=Pool(32)   # change number according to cores available
wavs=p.map(preprocess_wav,[i for i in wav_fpaths])

Try this out. should save you quite a bit of time.

P.S. you wont be seeing the tqdm prog bar though.

P.P.S.

This is pretty dope.

1 Like

The model link popped up a request for access in Google Drive, so submitted that a little earlier.

I plan to get the notebook working with your pre-trained model first (basically exactly as you have it, against LibriTTS data which I’ve got downloaded (somewhere!) on my desktop), then I’ll see about using with my own dataset to try to cluster it similarly as I did with my gist using bokeh

should be fine now…

Thx! I’ll be waiting for it.

1 Like

I’ve made a pull request with the interactive plotting changes I got working earlier

It can handle either a single speaker or multi speakers. I slightly adjusted compute_embeddings.py so it could also pull in my corpus (basically same format as LJSpeech although I’ve only two columns in the csv which it can now optionally read)

Needs the speaker embedding model to be downloaded as per the Speaker Encodings README, install Bokeh, produce the embeddings, update the file locations in the notebook and you should be good to go.

There’s a quick screencast of it here too (sorry - the voice-over is a little too quiet and then the audio samples played are much louder!! Take care while listening :wink: Also cursor position seems off in the recording vs actual location)

2 Likes

Good video and PR! You can also add the video link to the encoder README.md if you like, in your PR.

Sure, am away from my computer right now but I’ll do that this evening

Thanks again for this project - am really pleased I could make a small contribution by way of thanks for the huge efforts and skill you’ve put into it :slightly_smiling_face: