Fine-Tune speaker-encoder on own data. Is it worth it?

sanjaesc · December 4, 2020, 3:36pm

Hello everyone,

As the title suggests, would it make sense to fine-tune the current speaker-encoder to your own data with the aim of obtaining better embeddings for later multi-speaker training? Or will it rather distort the quality of the encoder and harm the embeddings and further training?

Cheers,
Alex

erogol · December 4, 2020, 5:10pm

I’ve not tried finetuning it. If you do let me know. I wonder how it performs as well.

sanjaesc · December 7, 2020, 10:14am

I’ve fine-tuned the speaker-encoder for ~10k steps with some modifications to the audio params.

Plotting the Umap I get the following results.

Left is the original, while right is the fine-tuned one.

On the left the female and male speakers are separated nicely but the embeddings kinda merge into each other. While on the right they are for the most part split nicely, but the gender split is not visible.

I guess with the fine-tuned embeddings I should get better results regarding the characteristics of the voice.

With the original embeddings some speakers kinda lost their characteristics and and sounded identical to each other.

Dateset:

Speakers: 159
Language: German
Audio was extracted from games (Skyrim, Witcher 3, Gothic 1-3)
High prosody in speech, also some speakers voice multiple characters.

As a foot note, with the current implementation I was somewhat able to train a multi-speaker TTS with r=1.

erogol · December 6, 2020, 3:13am

thx for sharing. So we can say finetuning works well.

What language is it?

Do you see any other difference than having different speakers in comparison to LibriTTS?

sanjaesc · December 7, 2020, 10:40am

Updated my post.

Not really.

erogol · December 7, 2020, 11:37am

out of curiosity, how do you extract voice frok games?

sanjaesc · December 7, 2020, 11:58am

Using tools provided by the respective community.

The above games have good Transcription-Speech alignment.

But there is still a lot of pre-processing required after the files are extracted, which I spent the second most time with.

erogol · December 8, 2020, 3:16pm

respective community ?

sanjaesc · December 8, 2020, 3:35pm

Gothic for example has still a big following and modding community. Tools to extract files can be found here https://www.worldofgothic.de/dl/download_27.htm.

Skyrim has something like this https://www.nexusmods.com/skyrim/mods/82482/

TheDayAfter · December 9, 2020, 8:17pm

Your sound samples show that professional speakers like those from Gothic can make a huge difference. Though children might be scared the overall quality is impressive, at least for my ears. I guess its just for private use only or demo purposes as all of the downloads to some models in the past are not valid anymore which holds for this forum and the worldofgothic too? Otherwise, i would appreciate a PM

sanjaesc · December 12, 2020, 12:55pm

At the moment I am mainly experimenting. So no functional model yet. I also have to deal with copyright, under which conditions I can release the model for public use. Since it is based on data from licensed games.

TheDayAfter · December 12, 2020, 8:42pm

Got it.

Experiments need time to be successful, a lousy attempt minental sample