Data and training considerations to improve voice naturalness

It is really smart. I’ve also implemented the same paper as the repo, multi-speaker training in mind. If I find some time I can release it under TTS.

2 Likes

Here’s the Jupyter Notebook for the Resemblyzer/Bokeh plot I mentioned above, in a gist: https://gist.github.com/nmstoker/0fe7b3e9beae6e608ee0626aef7f1800

You can ignore the substantial warning that comes from the UMAP reducer. Depending on the number of samples and the computer you use, it can take a while to run (so may be worth running through with a more limited number of .wav files initially just to be sure everything works). Takes about 40+ mins on my laptop.

When it has produced the scatter plot, navigate to the location of your wav files, and in that location start a local server (with the same port as used in the notebook):

python3 -m http.server

and then you should be able to click on the points in the chart and it’ll open them in another tab. I’ve seen people have code to make a Bokeh chart play the audio directly but I haven’t tried that yet (and this basically works well enough)

Here’s a screenshot of the scatter plot, with the two main clusters standing out quite clearly.

1 Like

Would you be willing to adapt your notebook to https://github.com/mozilla/TTS/tree/dev/speaker_encoder? That’d be a great contribution! I already have a model trained on LibriSpeech with 900 speakers that I can share

Yes, I’d be keen to give that a shot. I’ll have to look over the code there in a bit more detail and I’ll probably have a few questions.

Feel free to ask questions as you like :slight_smile:

Hey Neil,do you remember which part took longer to run? I am trying to speed things up.

Roughly it was the looping over all the wav files with preprocess_wav(wav_fpath) that took the most, but the next two steps also took a decent amount of time but not quite as long I think. I’ll be trying to get some time to look at it this evening or tomorrow evening, so if I get updated timings I can share those.

from multiprocessing import Pool
p=Pool(32)   # change number according to cores available
wavs=p.map(preprocess_wav,[i for i in wav_fpaths])

Try this out. should save you quite a bit of time.

P.S. you wont be seeing the tqdm prog bar though.

P.P.S.

This is pretty dope.

1 Like

The model link popped up a request for access in Google Drive, so submitted that a little earlier.

I plan to get the notebook working with your pre-trained model first (basically exactly as you have it, against LibriTTS data which I’ve got downloaded (somewhere!) on my desktop), then I’ll see about using with my own dataset to try to cluster it similarly as I did with my gist using bokeh

should be fine now…

Thx! I’ll be waiting for it.

1 Like

I’ve made a pull request with the interactive plotting changes I got working earlier

It can handle either a single speaker or multi speakers. I slightly adjusted compute_embeddings.py so it could also pull in my corpus (basically same format as LJSpeech although I’ve only two columns in the csv which it can now optionally read)

Needs the speaker embedding model to be downloaded as per the Speaker Encodings README, install Bokeh, produce the embeddings, update the file locations in the notebook and you should be good to go.

There’s a quick screencast of it here too (sorry - the voice-over is a little too quiet and then the audio samples played are much louder!! Take care while listening :wink: Also cursor position seems off in the recording vs actual location)

2 Likes

Good video and PR! You can also add the video link to the encoder README.md if you like, in your PR.

Sure, am away from my computer right now but I’ll do that this evening

Thanks again for this project - am really pleased I could make a small contribution by way of thanks for the huge efforts and skill you’ve put into it :slightly_smiling_face: