Looking at the config files for training in WaveGrad or for the universal MelGAN , I see that the dataset LibriTTS is used.
I’d previously tried just a single speaker (a private dataset) for a WaveGrad vocoder which gave really impressive results for that speaker. And now I’m training with another single speaker from LibriVox/M-AILABS ( I hope to share the model once it’s done).
What I’m interested to hear is if anyone has insight into how the vocoder output quality is impacted for a specific voice a) by using more speakers and b) by the diversity/similarity of the voices used to train the vocoder.
I’m curious to see whether training for a few similar voices would result in those voices maintaining quality better than say a wider group of dissimilar voices. Given time, I may be able to figure this out by experimenting myself but as it’s three+ days to get a decent WaveGrad on my GPU I thought I’d check what experiences or advice others had.
Having one multi speaker vocoder for the voices I want to generate speech for would be useful but if it reduces quality I guess I’d stick with one per speaker.
Compared to LibriTTS, what I’ve got is more data from a small number of speakers (eg 20-40hrs from a handful) rather than smaller amounts of data from a wide range of speakers (eg 30m from 200+).