My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

I’m near the end of my lunch break, so I need to be brief and I’ll add more on some skipped details this evening, but wanted to share output I’ve produced with my private dataset trained for Tacotron2 and also with the MelGAN vocoder.

I’m really pleased with the results - whilst I can hear aspects I’d like to improve still (tone and sometimes speed), I suspect they are largely down to my dataset.

For some reason the Soundcloud link isn’t working here if pasted alone, but I think this should work now: https://soundcloud.com/user-726556259/sets/tacotron2-plus-melgan-v1

The output was produced by slightly adapting server.py / synthesizer.py based on the CoLab here: https://github.com/mozilla/TTS/issues/345#issuecomment-586778834

1 Like

Nice work! which melgan implementation is that ?

Could you also share some Dataset examples to compare the results?

I’ll get back more fully over the weekend but I based my config on this one: https://github.com/erogol/ParallelWaveGAN/blob/tts/parallel_wavegan/melgan_v3.yaml

As I’m using a Ti1080 (ie 11Gb) I had previously had memory issues (when I’d tried training with an earlier version of this repo) so I’d slightly reduced the value for batch_max_steps. In retrospect I realise now I didn’t test whether those issues came up still, I just assumed they might.

Happy to share some samples from the dataset (which I’ll also have to do over the weekend :slightly_smiling_face:) Are you after examples of the full length recordings used to train TTS or the samples that ParallelWaveGAN produces (ie those shorter snippets it includes along with the generated audio output during training)

I was more after the ground-truth audio samples of your dataset you used to train your models. Thx for sharing all this btw.

Hi @erogol

Here’s some more detail, including some audio samples used for training.

My audio dataset
The samples here are just picked randomly from the full set. Overall the full set varies in sentence length reasonably well. They have been recorded over quite some time, so unfortunately in my wider set there are likely to be some with slightly inconsistent style but I’ve generally tried to keep it neutral and steady, weeding out some of the most different / worst cases by use of the Speaker Embedding chart and a bit of brute effort :slightly_smiling_face:

The dataset is getting on for 16+ hours now. I’m happy to share the handful of samples included here, but wish to keep my overall dataset private (as it’s my own voice).

Most recent training run
Having seen how good your results were with MelGAN here: https://github.com/mozilla/TTS/issues/345
I planned to follow as closely as I could the same basic approach simply using my dataset instead:

  • Training a Tacotron2 model
    • First using Forward Attention enabled until ~400K (although ended up being 525k in the end)
    • Then finetuning with BN
  • Then train a MelGAN model for the vocoder from here

One small difference is that I’m using ng-espeak, with UK RP phonemes (given that’s closer to my accent than the default US English output) and it’s a slightly customised version as I recompiled the dictionary for it with some missing words which feature fairly commonly in my dataset (which the default ng-espeak UK RP setting was not converting accurately). I suspect the impact of the customisation is fairly small for training.

When I ran the initial run for Tacotron2 (see 1_config.json in zip) I let it run for longer and ended up using the best model from around 525k as the one that I then finetuned with BN (2_config.json)

Training the vocoder was fairly straight forward, although initially I got a little mixed up about the file metadata.txt coming from bin/preprocess_tts.py (I may be wrong, but I think in the earlier version it was using csv format train and validate metadata files and I initially tried with those files, as I’d used them when I looked at this repo late last year; anyway no big deal - the file format was rejected and I soon figured it out - will teach me to read the code more before jumping in!! :wink: )

The vocoder ran for the full 400k times. The config for that is melgan_v3_neil15.yaml in the zip file attached. I changed the value for batch_max_steps as mentioned above, but otherwise it was based on melgan_v3.yaml

I did wonder if perhaps I should train MelGAN even longer for better results - that’s just a matter of setting the resume argument isn’t it?

Hope that helps - fire away with any other questions etc. I’ll be back online again this evening.

Original_audio_samples_and_config_files.zip (468.4 KB)

Am now trying a longer run with MelGAN. Resume parameter seems to be working fine, although it did look like it was stuck at first (just took a long time with no obvious change after showing the run parameters, but then the progress bar reappeared with 400k, going up to my new 800k target)

Thx for sharing all these.

Regarding MelGAN, yes training more always help. Even after 2m iterations, you still see improvements.

Good to know! Thanks.

@nmstoker Hey nmstoker, I have tried to trained ParallelWaveGAN on Indonesia datasets. The sample below sound not quite clear. Have you ever encountered similar situation ? Thanks a lot. :pray:

Indonesian_wavernn_sample.zip (861.4 KB)