Help for training a multi speaker model for voice cloning

julian.weber · September 24, 2020, 1:42pm

Hi,

I’m training a model in french to do voice cloning and I’m using CorentinJ’s encoder to compute the speaker embedding. I started to train on the 5 voices of the french mailabs and got good results. It’s not sufficient to reproduce accent or anything but the voice generated is ineligible and the pitch matches with the cloned speaker.

To increase the fidelity of the cloning I tried adding the “mix” folder of the mailabs dataset. (Each chapter of the book is read by a different speaker) and then the quality of the voice started to go downhill quickly but I didn’t noticed since everything seemed fine on the tensor board. (grey and orange curves)

I then (because the tensorboard was fine) added the french part of the common voice dataset set where a speaker only has around 4 voice samples (12k different speakers). The tensorboard told me that the model struggled to make the loss as small as before but it was understandable since the data is much more noisy and every speaker has a different mic.

I was a bit horrified by the sythetised results after 260k steps the models can’t form one word correctly in a sentence. Enven the checkpoint before common voice. So I have a few questions:

Is it possible with a multispeaker model to have test sentences with a specific speaker ? Or test sentences at all for that matter since even though I specified test_sentences_file in config.json, I coulnd’t see any of my test sentences in the tensorboard. (it worked before on a single speaker model)
Is it detrimental to have all samples under one “mix” speaker since every sample has its speaker embedding computed ? (I have my original speakers but all other non identified speaker are under the mix speaker)
When using different datasets like I did, are there recommended preprocessing that can help ? Audio normalization ?
Why can’t we use mean_var norm for multi speaker models ?
(in the config.json we can read : )

“stats_path”: null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by ‘compute_statistics.py’. If it is defined, mean-std based normalization is used and other normalization params are ignored

Thanks a lot for reading this post

Note: For big datasets I find the eval interval impractical, I trained for 260k steps and the last eval was around 190k. Wouldn’t it be interesting to set a eval interval based on training steps and not epoch ?

georroussos · September 24, 2020, 6:31pm

I don’t think you can have test sentences with specific speakers, but I might be wrong. I do not think I have seen a switch. But you might be able to write some lines and use a delimiter in the test.txt file. I don’t think it is trouble what you are saying about the mix speaker. Since every utterance has its own embedding and I doubt every training batch separates speakers. For normalization, I personally have used the default configuration. Re: the mean-var, I think Eren mentioned it is a bit of a hassle to compute mean-var stats for multispeaker, because these have to be separate for each speaker.

The multispeaker TTS works ok (I have run some tests), but my opinion is that it absolutely benefits from restoring pretrained weights on it. So maybe first you’d like to train a French single speaker and then restore that. However, do bear in mind that in an ideal scenario the speech is very clean. What I advise is to start with English instead, because then you can restore the LJSpeech TTS and either use LibriTTS or VCTK. @edresson1 has gotten very impressive results on VCTK and he has even been able to form artificial speakers.

julian.weber · September 29, 2020, 12:12pm

Thanks for your answer, I wasn’t able to get good cloning results from the collab notebook. Like erogol said in this comment, the clonned voice is not very close to the original speaker. But I think this will be fixed when we have a better speaker encoder.

I feel like it was the attention mechanism that failed me during training, do you know of an easy way to freeze the weights of certain part of the network during training ?

georroussos · September 29, 2020, 12:29pm

Yes! @edresson1 once taught me he has it implemented on his fork so that you can toggle freezing through the config.json, but if you want to use the original repo, you can go in train_tts.py and, e.g. if you want to freeze different parts:

for param in model.postnet.parameters():
            param.requires_grad = False

for name, param in model.decoder.named_parameters():
            param.requires_grad = False

julian.weber · September 29, 2020, 12:34pm

Great thank you so much. I’ll keep you posted here if I get good results

georroussos · September 29, 2020, 12:36pm

No problem! But I’d give restoring pretrained weights a chance it always helps me no matter what

julian.weber · September 29, 2020, 12:43pm

You mean restoring only part of the models that you interested in ? I was going to restore my model to 70k wait for the loss to stabilize and then freeze the weights (I’m thinking attention and stopnet) and add the other datasets