My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

nmstoker · February 26, 2020, 1:43pm

I’m near the end of my lunch break, so I need to be brief and I’ll add more on some skipped details this evening, but wanted to share output I’ve produced with my private dataset trained for Tacotron2 and also with the MelGAN vocoder.

I’m really pleased with the results - whilst I can hear aspects I’d like to improve still (tone and sometimes speed), I suspect they are largely down to my dataset.

For some reason the Soundcloud link isn’t working here if pasted alone, but I think this should work now: https://soundcloud.com/user-726556259/sets/tacotron2-plus-melgan-v1

The output was produced by slightly adapting server.py / synthesizer.py based on the CoLab here: https://github.com/mozilla/TTS/issues/345#issuecomment-586778834

erogol · February 28, 2020, 12:05pm

Nice work! which melgan implementation is that ?

erogol · February 28, 2020, 12:06pm

Could you also share some Dataset examples to compare the results?

nmstoker · February 28, 2020, 1:40pm

I’ll get back more fully over the weekend but I based my config on this one: https://github.com/erogol/ParallelWaveGAN/blob/tts/parallel_wavegan/melgan_v3.yaml

As I’m using a Ti1080 (ie 11Gb) I had previously had memory issues (when I’d tried training with an earlier version of this repo) so I’d slightly reduced the value for batch_max_steps. In retrospect I realise now I didn’t test whether those issues came up still, I just assumed they might.

Happy to share some samples from the dataset (which I’ll also have to do over the weekend ) Are you after examples of the full length recordings used to train TTS or the samples that ParallelWaveGAN produces (ie those shorter snippets it includes along with the generated audio output during training)

erogol · February 28, 2020, 1:41pm

I was more after the ground-truth audio samples of your dataset you used to train your models. Thx for sharing all this btw.

nmstoker · March 1, 2020, 1:17pm

Hi @erogol

Here’s some more detail, including some audio samples used for training.

My audio dataset
The samples here are just picked randomly from the full set. Overall the full set varies in sentence length reasonably well. They have been recorded over quite some time, so unfortunately in my wider set there are likely to be some with slightly inconsistent style but I’ve generally tried to keep it neutral and steady, weeding out some of the most different / worst cases by use of the Speaker Embedding chart and a bit of brute effort

The dataset is getting on for 16+ hours now. I’m happy to share the handful of samples included here, but wish to keep my overall dataset private (as it’s my own voice).

Most recent training run
Having seen how good your results were with MelGAN here: Model Release: Tacotron2 with Forward Attention - LJSpeech · Issue #345 · mozilla/TTS · GitHub
I planned to follow as closely as I could the same basic approach simply using my dataset instead:

Training a Tacotron2 model
- First using Forward Attention enabled until ~400K (although ended up being 525k in the end)
- Then finetuning with BN
Then train a MelGAN model for the vocoder from here

One small difference is that I’m using ng-espeak, with UK RP phonemes (given that’s closer to my accent than the default US English output) and it’s a slightly customised version as I recompiled the dictionary for it with some missing words which feature fairly commonly in my dataset (which the default ng-espeak UK RP setting was not converting accurately). I suspect the impact of the customisation is fairly small for training.

When I ran the initial run for Tacotron2 (see 1_config.json in zip) I let it run for longer and ended up using the best model from around 525k as the one that I then finetuned with BN (2_config.json)

Training the vocoder was fairly straight forward, although initially I got a little mixed up about the file metadata.txt coming from bin/preprocess_tts.py (I may be wrong, but I think in the earlier version it was using csv format train and validate metadata files and I initially tried with those files, as I’d used them when I looked at this repo late last year; anyway no big deal - the file format was rejected and I soon figured it out - will teach me to read the code more before jumping in!! )

The vocoder ran for the full 400k times. The config for that is melgan_v3_neil15.yaml in the zip file attached. I changed the value for batch_max_steps as mentioned above, but otherwise it was based on melgan_v3.yaml

I did wonder if perhaps I should train MelGAN even longer for better results - that’s just a matter of setting the resume argument isn’t it?

Hope that helps - fire away with any other questions etc. I’ll be back online again this evening.

Original_audio_samples_and_config_files.zip (468.4 KB)

nmstoker · March 2, 2020, 12:51am

Am now trying a longer run with MelGAN. Resume parameter seems to be working fine, although it did look like it was stuck at first (just took a long time with no obvious change after showing the run parameters, but then the progress bar reappeared with 400k, going up to my new 800k target)

erogol · March 2, 2020, 12:19pm

Thx for sharing all these.

Regarding MelGAN, yes training more always help. Even after 2m iterations, you still see improvements.

nmstoker · March 2, 2020, 1:03pm

Good to know! Thanks.

petertsengruihon · March 20, 2020, 3:57am

@nmstoker Hey nmstoker, I have tried to trained ParallelWaveGAN on Indonesia datasets. The sample below sound not quite clear. Have you ever encountered similar situation ? Thanks a lot.

Indonesian_wavernn_sample.zip (861.4 KB)

dkreutz · July 14, 2020, 1:22pm

@nmstoker I trying MelGAN as well and I am unsure about the dataset preparation, moreover spllitting the dataset:
How did you split your dataset (train + dev/eval)?

georroussos · July 14, 2020, 1:53pm

In PWGan I extract features for all wavs and then move some of them to a separate folder for dev/eval.

dkreutz · July 14, 2020, 2:00pm

Is there any recommendation for a split ratio, e.g. 10:1 or are there only a few feature-files for dev/eval necessary (currently I have approx. 22.000 for train and 200 for eval)?

Or to put it in another way: does the dev/eval set influence the learning (I did not fully grasp the discriminator training yet)?

georroussos · July 14, 2020, 2:03pm

I think that was also my split I got good results on my single speaker, but when I tried multispeaker it was very bad. I have no idea why, but there was a lot of static. In single speaker, 400K steps were enough to get good quality.

nmstoker · July 14, 2020, 2:53pm

I’ve been putting all my audio into one set, both for ParallelWaveGAN and the recently integrated vocoder (which has just finished a first long run).

Originally with ParallelWaveGAN, I’d done this to follow the training config:

meta_file_train: "metadata.txt",

and it seemed to work well there. I do see general discussions on GANs suggesting that it’s still worth having test so I might try it in time.

One minor point with the integrated vocoder: it just reads the wavs in a directory (rather than looking in a metadata file to see which are to be used). With both my normal training set and test set, I have a fair number of wavs present that aren’t in my metadata files (I had to discard fair number of samples, I would cut the line from metadata but not remove the wav). Therefore I’d need a little tidy up and so as a shortcut I left all the audio - the main issue behind me cutting wavs from the training/test sets was transcription errors or prematurely cut recordings, neither of which should particularly impact training the MelGAN (given that it’s trying to recreate the audio input; in fact it may benefit because it then has a bit more audio to learn from).

FYI: I’m going to post details on my DDC runs soon (later today I hope, as I’m off work and on a “staycation” so for once I have time during the week!) Then will follow up later with details from the integrated vocoder (which I need to do more runs with)

Topic		Replies	Views
Tacotron2 + PWGAN produces Deep/Muffled Voice TTS (Text-to-Speech)	9	2989	June 7, 2021
Tacotron 2 with ParallelWaveGAN. Next step TTS (Text-to-Speech)	30	3719	September 15, 2020
Any success with mel-gan? TTS (Text-to-Speech)	0	411	January 15, 2020
Train Multispeaker Dataset + WaveRNN TTS (Text-to-Speech)	50	5747	October 5, 2020
Training a universal vocoder TTS (Text-to-Speech) participation	25	2969	August 18, 2020

My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

Related topics