Universal / multi-speaker vocoders

Looking at the config files for training in WaveGrad or for the universal MelGAN , I see that the dataset LibriTTS is used.

I’d previously tried just a single speaker (a private dataset) for a WaveGrad vocoder which gave really impressive results for that speaker. And now I’m training with another single speaker from LibriVox/M-AILABS ( I hope to share the model once it’s done).

What I’m interested to hear is if anyone has insight into how the vocoder output quality is impacted for a specific voice a) by using more speakers and b) by the diversity/similarity of the voices used to train the vocoder.

I’m curious to see whether training for a few similar voices would result in those voices maintaining quality better than say a wider group of dissimilar voices. Given time, I may be able to figure this out by experimenting myself but as it’s three+ days to get a decent WaveGrad on my GPU I thought I’d check what experiences or advice others had.

Having one multi speaker vocoder for the voices I want to generate speech for would be useful but if it reduces quality I guess I’d stick with one per speaker.

Compared to LibriTTS, what I’ve got is more data from a small number of speakers (eg 20-40hrs from a handful) rather than smaller amounts of data from a wide range of speakers (eg 30m from 200+).

Training on similar voices is something I have never heard of, but it kind of makes sense, if you think about it. :sweat_smile:And it should not be hard to find similar voices, especially with the new speaker encoder. Personally, I think a great part of voice universality depends on the vocoder architecture and how robust it is. HiFiGAN, for example, is able to model voices of the same gender pretty adequately, when it is trained on one voice, but is fed a spectrogram from a completely different voice. I would be really interested to see how it fares multispeaker wise, especially because it can synthesize high fidelity audio after 1 day and it is much faster than WaveGrad. Also, the quality I’ve got with HiFiGAN was much better than what I got with WaveGrad. Now, as far as the training set goes, my thinking is that quality difference between samples may affect output quality. So, since this is a vocoder and it’s language agnostic, personally I would try look for a high quality multispeaker training set (I have noticed there a lot of discrepancies in LibriTTS). I know there is a Japanese one and it is much higher quality, but do not know the name. It should be on the openslr.org list.

Thanks George! I am definitely meaning to give HiFiGAN a go in the near future, following your recommendation.

You raise a good point about quality level differences between samples.

Just to be clear, wavegrad is definitely better when it is run more than 50 iterations and comparable with 15 iterations.

Also Wavegrad is more adaptive for being a universal vocoder. It generalizes better for different speakers.

You can also get comparable results with only 6 iters by initializing the input with MelGAN or GF instead of providing a random input.

But of course, GAN models are faster especially on CPU

Thanks Eren. For me the WaveGrad results have been remarkably good.

On the sample quality point I was only meaning that I thought I’d need to check that the samples I used to train with were consistent between speakers (as George mentioned).

It’s good to hear it’s adaptable for different speakers. Do you think that it adapts more easily if the speakers used to train with are generally similar or is that not an issue really?

I don’t know it for sure but having similar speakers sounds like an easier problem that might also lead to easier learning. I guess it is better to try.

How do you define similar to be more precise ?

Informally I was thinking the same apparent gender, age and accent and then more formally I was thinking I’d see if clusters of recordings were near each other in the UMAP plots using the speech embeddings.

I’ll give it a go training first on a pair of similar voices and then on a pair of dissimilar voices to see how the results work. It’ll probably be quite a few days before i can get to it but I’ll share what I find here.

Now after giving a second thought, I think it is easier for the model to learn voices if they are more distinct. I assume, otherwise, model produces something interpolated of the similar voices. But I am just guessing here.