Here’s some more detail, including some audio samples used for training.
My audio dataset
The samples here are just picked randomly from the full set. Overall the full set varies in sentence length reasonably well. They have been recorded over quite some time, so unfortunately in my wider set there are likely to be some with slightly inconsistent style but I’ve generally tried to keep it neutral and steady, weeding out some of the most different / worst cases by use of the Speaker Embedding chart and a bit of brute effort
The dataset is getting on for 16+ hours now. I’m happy to share the handful of samples included here, but wish to keep my overall dataset private (as it’s my own voice).
Most recent training run
Having seen how good your results were with MelGAN here: https://github.com/mozilla/TTS/issues/345
I planned to follow as closely as I could the same basic approach simply using my dataset instead:
- Training a Tacotron2 model
- First using Forward Attention enabled until ~400K (although ended up being 525k in the end)
- Then finetuning with BN
- Then train a MelGAN model for the vocoder from here
One small difference is that I’m using ng-espeak, with UK RP phonemes (given that’s closer to my accent than the default US English output) and it’s a slightly customised version as I recompiled the dictionary for it with some missing words which feature fairly commonly in my dataset (which the default ng-espeak UK RP setting was not converting accurately). I suspect the impact of the customisation is fairly small for training.
When I ran the initial run for Tacotron2 (see 1_config.json in zip) I let it run for longer and ended up using the best model from around 525k as the one that I then finetuned with BN (2_config.json)
Training the vocoder was fairly straight forward, although initially I got a little mixed up about the file metadata.txt coming from bin/preprocess_tts.py (I may be wrong, but I think in the earlier version it was using csv format train and validate metadata files and I initially tried with those files, as I’d used them when I looked at this repo late last year; anyway no big deal - the file format was rejected and I soon figured it out - will teach me to read the code more before jumping in!! )
The vocoder ran for the full 400k times. The config for that is melgan_v3_neil15.yaml in the zip file attached. I changed the value for batch_max_steps as mentioned above, but otherwise it was based on melgan_v3.yaml
I did wonder if perhaps I should train MelGAN even longer for better results - that’s just a matter of setting the resume argument isn’t it?
Hope that helps - fire away with any other questions etc. I’ll be back online again this evening.
Original_audio_samples_and_config_files.zip (468.4 KB)