Cloning my own voice does not work at all

Hello I am trying to clone my voice using TTS.
I am using this collab notebook and following all steps.
https://colab.research.google.com/drive/1OX2jtyxmeFt7kRO65QgPy65ku6HWUrFK?usp=sharing#scrollTo=yvb0pX3WY6MN

I even uploaded a new audio wav with nearly 1 min audio
The results are not satisfactory, there is no change from the default voice that is generated.

Best Regards

Hello @guitarplayersachin.
Welcome to our community :slight_smile:.

I don’t know this logic by @edresson1 (hope i pinged the right one) but even if it’s based on an existing model just one wave file with a minute of audio seems not to be enough. Just for comparison training a new model based on your own voice takes several hours of your audio recordings.

I tried it a few days back and it did alter the character of the voice in the final steps where you submit your own recording.

There could be lots of factors, such as the quality / style of the sample @guitarplayersachin used or possibly a mistake in running it (although as it’s a CoLab that should be less likely)

The other aspect is whether you’ve got reasonable expectations - it’s not going to change the voice to a level that others will think it’s you, it introduces some of the character of your voice into the synthesised voice.

I believe there is a new addition to the repo in this pull request.

This refers to the collab link i shared in the previous post, and the collab notebook specifies that you need atleast 3 seconds of video to clone your own voice.
As shown in the below screeshot.

I am really happy to be posting here and thanks for your response

Best Regards

Thank you for your response.
I assume we are to run all the cells ? One by one.

Yes, it asks for input at a couple of points, so you might as well step through them one by one.

Further to @mrthorstenm 's point, he’s right that if you want to train a really accurate model you’d need a lot more data. Using more data with the main TTS repo can produce (after a fair bit of effort) a decent voice. What @edresson1 's repo does is a little different (and very cool!) - it explores making the voice output differ with things called GST (global style tokens) and the voice embeddings, so you can change characteristics within the output voice. They’re slightly different goals but part of the same overall effort.

Also welcome @guitarplayersachin - please excuse my not saying that before :slightly_smiling_face:

1 Like

Hello, this is still in development, we hope to achieve better results soon. However, this should not achieve results like training a model in a few hours of your voice. The initial idea is to generate data to train ASR systems, so we just want to make changes to the voices. This pre-trained model can be used to adjust your voice with a lot of speaking time, basically take this model and I trained for some steps on some many of your voice. The model is already aware of multiple speakers so it should facilitate training in a specific voice.

However, we hope to improve the model to sound closer to the speaker’s voice. At the moment this model was only trained using the VCTK dataset, which is a dataset with only 109 speakers and which has a limited vocabulary. We are testing several possibilities and after the end of these tests we intend to train the model in LibriTTS, so we should have better results.

If I understand correctly, with the current colab demo if I have a large recording of my voice, somewhere around say 10 hours I should be able to get the output to reflect my voice ?

Secondly, I would love to understand the in’s and out’s of this implemenation. Would you please guide me to some resources.
Best Regards

Basically the more data the closer to your voice the synthesization should be. However with a large amount of data it is more advantageous to finetunnig the model in your voice. The objective of the model is that with a few seconds of a voice it is possible to synthesize a nearby voice. However, the model is still under development, so we have not yet achieved the desired results. However for speakers seen in training, the voices to sound better than other models. In addition we can generate random embeddings and generate new voices (artificial voices), this can be useful for training ASR models (which is one of my personal reasons for working on it!).

We are planning to train a model with LibriTTS this should bring better results as we will have more speakers. The current model was trained with the VCTK, which has only 109 speakers and a much more limited vocabulary.

1 Like