Hi @smg, I hope I can help but it would be good to confirm your objective with the speaker embeddings here as there are some different ways it can be used.
You mention that you’re trying to use the various Italian speakers from Common Voice. Are you trying to produce a) a multi speaker model as per issue #166 or b) trying to train a single TTS but simply using all the audio from a range of different speakers?
If a): I haven’t looked at multi speaker models myself so I would have to defer to the expertise of others, but the impression I got from reading #166 is that it’s not able to produce such clear voices (yet)
If b): That’s not something I’ve tried, but with my recordings of a single speaker (me!) I’ve seen that including audio where I haven’t spoken in a really consistent speaking style results in a big impact on output voice quality, so I’m thinking it would be unlikely to work well at all where the speakers were different people.
The part I expect I can help with is if you’re looking at using the speaker embeddings tools to help analyse the audio you’re planning to use for training a TTS model.
If that’s what you want then I’d be trying to guide you to recreate what’s shown in the video for the multi speaker audio or for a single speaker.
To do that you’d need the pertained model from Released Models. It is the one that @sanjaesc showed.
You’d want to go to the /speaker_encoder directory and adjust the paths in config.json so they pointed at your audio for the datasets path and at a suitable location for the output_path (this is where you’ll be saving the embeddings, with are .npy files that correspond to each of your audio files). You can also just pass these paths as parameters to compute_embeddings.py too, so I’m not sure strictly how necessary it is to change them in config.json. Also to clarify, the contents of the config.json in that directory are currently the same as the config.json that comes with the model, but I expect you’d want to use the ones supplied with the model you’re using if they were to differ (for instance if you’d trained a speaker embedding model with different parameters)
Then run compute_embeddings.py, passing the relevant arguments (so the use_cuda parameter and then the model you downloaded above (as per @sanjaesc), the config and the paths. This works through your audio files and produces the .npy files I mentioned above.
Then go into the Notebook directory and launch the Jupiter notebook. It’ll need some basic edits to point the paths at your files, but this should be fairly obvious (eg MODEL_RUN_PATH and the others must point to somewhere on your computer). Depending on where you stored your embeddings you may need to edit that glob command for embed_files.
Also under Set up the embeddings, comment/uncomment the relevant part for single speaker or multispeaker.
Finally you should be in a position to step through the Notebook
If all goes smoothly you should see the Bokeh chart near the end of the notebook. Don’t forget to run the final cell, which starts a local server, so that the hyperlinks on each plotted embedding point to the corresponding audio file (this is the way that you can then click on the various sections of the chart and hear what the corresponding audio sounds like, as is shown in the video)
So, that’s quite a few steps! I hope I’ve got it all right and been clear. Have a go and if you run into any problems you can’t figure out, report back as much useful detail as you can and I’ll do my best to assist