As the title suggests, would it make sense to fine-tune the current speaker-encoder to your own data with the aim of obtaining better embeddings for later multi-speaker training? Or will it rather distort the quality of the encoder and harm the embeddings and further training?
On the left the female and male speakers are separated nicely but the embeddings kinda merge into each other. While on the right they are for the most part split nicely, but the gender split is not visible.
I guess with the fine-tuned embeddings I should get better results regarding the characteristics of the voice.
With the original embeddings some speakers kinda lost their characteristics and and sounded identical to each other.
Dateset:
Speakers: 159
Language: German
Audio was extracted from games (Skyrim, Witcher 3, Gothic 1-3)
High prosody in speech, also some speakers voice multiple characters.
As a foot note, with the current implementation I was somewhat able to train a multi-speaker TTS with r=1.
Your sound samples show that professional speakers like those from Gothic can make a huge difference. Though children might be scared the overall quality is impressive, at least for my ears. I guess its just for private use only or demo purposes as all of the downloads to some models in the past are not valid anymore which holds for this forum and the worldofgothic too? Otherwise, i would appreciate a PM
At the moment I am mainly experimenting. So no functional model yet. I also have to deal with copyright, under which conditions I can release the model for public use. Since it is based on data from licensed games.