Two questions about Multi Speaker models

I realise the best answer is for me to simply try it, but thought I’d ask those who’d tried it for some general impressions first.

Q. With multi speaker models in TTS, is there any discernible influence from one voice to another? For instance do they start to pick up characteristics from each other or do the voices remain distinct and like the original?

Q. And do the voices yield better quality output than one would get by training just for one of the voices? By this, I mean is the act of training them all together helping the model learn overall common characteristics of language that it applies to all voices?

I’m wondering about whether a multispeaker model with several relatively similar voices would help them get to a better quality because they might somehow reinforce each other or whether it’s actually better to go for quite distinct voices (eg perhaps different accents or speaking patterns) which I’m guessing would generally be less likely to reinforce (if that happens at all?!) but may have some other advantage?

In my experience, training a multi-speaker model is more challanging since each speaker has a different distribution and different prosody. It makes things harder for the attention to align and decoder to learn the right voice.

However, it’s a different story for the vocoder model. It works better with multi-speaker training but it takes longer to converge.

So for TTS model, they don’t help each other to learn but they compete. Therefore, for the best quality, the dataset should be balanced in terms of number of records from each speaker.

1 Like

I have been working with Multispeaker for a while now but for the last two weeks it’s been in the backlog because of things. My objective is more of figuring out if I actually can get a voice with infuences from a sum of n voices off the dataset, by feeding the TTS the mean of these speakers’ embeddings. I actually started training a multispeaker TTS on LibriTTS with graves and I think I got it to 150k steps, where it was able to produce good speech, although it didn’t by any means converge to all voices; it was able to produce speech in different voices if I gave it different ID’s and, indeed, a random vector actually yielded a voice with mixed characteristics, but I think it was too soon that I stopped the training. What I noticed is that the first voice (id 0) was the most prominent one, at that point, although another voice was also there. Eren mentioned that, in general, a multispeaker TTS should be able to generalize okay, but I gather it greatly depends on the dataset and the voice representation stats, in order to get a good vector space representation. It also depends on your use case – if you want a single speaker TTS that is of good quality, I would guess it might be better to finetune a pretrained model (the best one for me so far has been the one trained with ForwardAttn and BN norm). As far as external embeddings go, I have tried to do some work here but it has failed some Travis tests so Eren was not able to merge it and I have not yet found time to look into what is wrong.

2 Likes

Thank you both for sharing those insights, it’s very much appreciated.

Looks like I’ll have to start weighing up options regarding datasets and try a few. I’m not quite there yet as am still (slowly!) working on methods to better handle pronunciation in my main dataset (to spot cases where the phonemes aren’t accurate or I’ve got heteronyms)

Will let you know how I get on when I do try this out. Thanks again!