Fine-Tune speaker-encoder on own data. Is it worth it?

Hello everyone,

As the title suggests, would it make sense to fine-tune the current speaker-encoder to your own data with the aim of obtaining better embeddings for later multi-speaker training? Or will it rather distort the quality of the encoder and harm the embeddings and further training?


I’ve not tried finetuning it. If you do let me know. I wonder how it performs as well.

I’ve fine-tuned the speaker-encoder for ~10k steps with some modifications to the audio params.

Plotting the Umap I get the following results.

Left is the original, while right is the fine-tuned one.

On the left the female and male speakers are separated nicely but the embeddings kinda merge into each other. While on the right they are for the most part split nicely, but the gender split is not visible.

I guess with the fine-tuned embeddings I should get better results regarding the characteristics of the voice.

With the original embeddings some speakers kinda lost their characteristics and and sounded identical to each other.


  • Speakers: 159
  • Language: German
  • Audio was extracted from games (Skyrim, Witcher 3, Gothic 1-3)
  • High prosody in speech, also some speakers voice multiple characters.

As a foot note, with the current implementation I was somewhat able to train a multi-speaker TTS with r=1.

thx for sharing. So we can say finetuning works well.

What language is it?

Do you see any other difference than having different speakers in comparison to LibriTTS?

Updated my post.

Not really.

out of curiosity, how do you extract voice frok games?

Using tools provided by the respective community.

The above games have good Transcription-Speech alignment.

But there is still a lot of pre-processing required after the files are extracted, which I spent the second most time with.

respective community ?

Gothic for example has still a big following and modding community. Tools to extract files can be found here

Skyrim has something like this


Your sound samples show that professional speakers like those from Gothic can make a huge difference. Though children might be scared the overall quality is impressive, at least for my ears. I guess its just for private use only or demo purposes as all of the downloads to some models in the past are not valid anymore which holds for this forum and the worldofgothic too? Otherwise, i would appreciate a PM :slight_smile:

At the moment I am mainly experimenting. So no functional model yet. I also have to deal with copyright, under which conditions I can release the model for public use. Since it is based on data from licensed games.

Got it.

Experiments need time to be successful, a lousy attempt minental sample