My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

I’m near the end of my lunch break, so I need to be brief and I’ll add more on some skipped details this evening, but wanted to share output I’ve produced with my private dataset trained for Tacotron2 and also with the MelGAN vocoder.

I’m really pleased with the results - whilst I can hear aspects I’d like to improve still (tone and sometimes speed), I suspect they are largely down to my dataset.

For some reason the Soundcloud link isn’t working here if pasted alone, but I think this should work now: https://soundcloud.com/user-726556259/sets/tacotron2-plus-melgan-v1

The output was produced by slightly adapting server.py / synthesizer.py based on the CoLab here: https://github.com/mozilla/TTS/issues/345#issuecomment-586778834

1 Like

Nice work! which melgan implementation is that ?

Could you also share some Dataset examples to compare the results?

I’ll get back more fully over the weekend but I based my config on this one: https://github.com/erogol/ParallelWaveGAN/blob/tts/parallel_wavegan/melgan_v3.yaml

As I’m using a Ti1080 (ie 11Gb) I had previously had memory issues (when I’d tried training with an earlier version of this repo) so I’d slightly reduced the value for batch_max_steps. In retrospect I realise now I didn’t test whether those issues came up still, I just assumed they might.

Happy to share some samples from the dataset (which I’ll also have to do over the weekend :slightly_smiling_face:) Are you after examples of the full length recordings used to train TTS or the samples that ParallelWaveGAN produces (ie those shorter snippets it includes along with the generated audio output during training)

I was more after the ground-truth audio samples of your dataset you used to train your models. Thx for sharing all this btw.

Hi @erogol

Here’s some more detail, including some audio samples used for training.

My audio dataset
The samples here are just picked randomly from the full set. Overall the full set varies in sentence length reasonably well. They have been recorded over quite some time, so unfortunately in my wider set there are likely to be some with slightly inconsistent style but I’ve generally tried to keep it neutral and steady, weeding out some of the most different / worst cases by use of the Speaker Embedding chart and a bit of brute effort :slightly_smiling_face:

The dataset is getting on for 16+ hours now. I’m happy to share the handful of samples included here, but wish to keep my overall dataset private (as it’s my own voice).

Most recent training run
Having seen how good your results were with MelGAN here: https://github.com/mozilla/TTS/issues/345
I planned to follow as closely as I could the same basic approach simply using my dataset instead:

  • Training a Tacotron2 model
    • First using Forward Attention enabled until ~400K (although ended up being 525k in the end)
    • Then finetuning with BN
  • Then train a MelGAN model for the vocoder from here

One small difference is that I’m using ng-espeak, with UK RP phonemes (given that’s closer to my accent than the default US English output) and it’s a slightly customised version as I recompiled the dictionary for it with some missing words which feature fairly commonly in my dataset (which the default ng-espeak UK RP setting was not converting accurately). I suspect the impact of the customisation is fairly small for training.

When I ran the initial run for Tacotron2 (see 1_config.json in zip) I let it run for longer and ended up using the best model from around 525k as the one that I then finetuned with BN (2_config.json)

Training the vocoder was fairly straight forward, although initially I got a little mixed up about the file metadata.txt coming from bin/preprocess_tts.py (I may be wrong, but I think in the earlier version it was using csv format train and validate metadata files and I initially tried with those files, as I’d used them when I looked at this repo late last year; anyway no big deal - the file format was rejected and I soon figured it out - will teach me to read the code more before jumping in!! :wink: )

The vocoder ran for the full 400k times. The config for that is melgan_v3_neil15.yaml in the zip file attached. I changed the value for batch_max_steps as mentioned above, but otherwise it was based on melgan_v3.yaml

I did wonder if perhaps I should train MelGAN even longer for better results - that’s just a matter of setting the resume argument isn’t it?

Hope that helps - fire away with any other questions etc. I’ll be back online again this evening.

Original_audio_samples_and_config_files.zip (468.4 KB)

1 Like

Am now trying a longer run with MelGAN. Resume parameter seems to be working fine, although it did look like it was stuck at first (just took a long time with no obvious change after showing the run parameters, but then the progress bar reappeared with 400k, going up to my new 800k target)

Thx for sharing all these.

Regarding MelGAN, yes training more always help. Even after 2m iterations, you still see improvements.

Good to know! Thanks.

@nmstoker Hey nmstoker, I have tried to trained ParallelWaveGAN on Indonesia datasets. The sample below sound not quite clear. Have you ever encountered similar situation ? Thanks a lot. :pray:

Indonesian_wavernn_sample.zip (861.4 KB)

@nmstoker I trying MelGAN as well and I am unsure about the dataset preparation, moreover spllitting the dataset:
How did you split your dataset (train + dev/eval)?

In PWGan I extract features for all wavs and then move some of them to a separate folder for dev/eval.

Is there any recommendation for a split ratio, e.g. 10:1 or are there only a few feature-files for dev/eval necessary (currently I have approx. 22.000 for train and 200 for eval)?

Or to put it in another way: does the dev/eval set influence the learning (I did not fully grasp the discriminator training yet)?

I think that was also my split :slight_smile: I got good results on my single speaker, but when I tried multispeaker it was very bad. I have no idea why, but there was a lot of static. In single speaker, 400K steps were enough to get good quality.

2 Likes

I’ve been putting all my audio into one set, both for ParallelWaveGAN and the recently integrated vocoder (which has just finished a first long run).

Originally with ParallelWaveGAN, I’d done this to follow the training config:

meta_file_train: "metadata.txt",

and it seemed to work well there. I do see general discussions on GANs suggesting that it’s still worth having test so I might try it in time.

One minor point with the integrated vocoder: it just reads the wavs in a directory (rather than looking in a metadata file to see which are to be used). With both my normal training set and test set, I have a fair number of wavs present that aren’t in my metadata files (I had to discard fair number of samples, I would cut the line from metadata but not remove the wav). Therefore I’d need a little tidy up and so as a shortcut I left all the audio - the main issue behind me cutting wavs from the training/test sets was transcription errors or prematurely cut recordings, neither of which should particularly impact training the MelGAN (given that it’s trying to recreate the audio input; in fact it may benefit because it then has a bit more audio to learn from).

FYI: I’m going to post details on my DDC runs soon (later today I hope, as I’m off work and on a “staycation” so for once I have time during the week!) Then will follow up later with details from the integrated vocoder (which I need to do more runs with)

3 Likes