I’m training a model in french to do voice cloning and I’m using CorentinJ’s encoder to compute the speaker embedding. I started to train on the 5 voices of the french mailabs and got good results. It’s not sufficient to reproduce accent or anything but the voice generated is ineligible and the pitch matches with the cloned speaker.
To increase the fidelity of the cloning I tried adding the “mix” folder of the mailabs dataset. (Each chapter of the book is read by a different speaker) and then the quality of the voice started to go downhill quickly but I didn’t noticed since everything seemed fine on the tensor board. (grey and orange curves)
I then (because the tensorboard was fine) added the french part of the common voice dataset set where a speaker only has around 4 voice samples (12k different speakers). The tensorboard told me that the model struggled to make the loss as small as before but it was understandable since the data is much more noisy and every speaker has a different mic.
I was a bit horrified by the sythetised results after 260k steps the models can’t form one word correctly in a sentence. Enven the checkpoint before common voice. So I have a few questions:
Is it possible with a multispeaker model to have test sentences with a specific speaker ? Or test sentences at all for that matter since even though I specified test_sentences_file in config.json, I coulnd’t see any of my test sentences in the tensorboard. (it worked before on a single speaker model)
Is it detrimental to have all samples under one “mix” speaker since every sample has its speaker embedding computed ? (I have my original speakers but all other non identified speaker are under the mix speaker)
When using different datasets like I did, are there recommended preprocessing that can help ? Audio normalization ?
Why can’t we use mean_var norm for multi speaker models ?
(in the config.json we can read : )
“stats_path”: null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by ‘compute_statistics.py’. If it is defined, mean-std based normalization is used and other normalization params are ignored
Thanks a lot for reading this post
Note: For big datasets I find the eval interval impractical, I trained for 260k steps and the last eval was around 190k. Wouldn’t it be interesting to set a eval interval based on training steps and not epoch ?