I’ve fine-tuned the speaker-encoder for ~10k steps with some modifications to the audio params.
Plotting the Umap I get the following results.
Left is the original, while right is the fine-tuned one.
On the left the female and male speakers are separated nicely but the embeddings kinda merge into each other. While on the right they are for the most part split nicely, but the gender split is not visible.
I guess with the fine-tuned embeddings I should get better results regarding the characteristics of the voice.
With the original embeddings some speakers kinda lost their characteristics and and sounded identical to each other.
Dateset:
- Speakers: 159
- Language: German
- Audio was extracted from games (Skyrim, Witcher 3, Gothic 1-3)
- High prosody in speech, also some speakers voice multiple characters.
As a foot note, with the current implementation I was somewhat able to train a multi-speaker TTS with r=1.