Data and training considerations to improve voice naturalness

erogol · October 31, 2019, 3:38pm

It is really smart. I’ve also implemented the same paper as the repo, multi-speaker training in mind. If I find some time I can release it under TTS.

nmstoker · November 1, 2019, 2:54am

Here’s the Jupyter Notebook for the Resemblyzer/Bokeh plot I mentioned above, in a gist: Example of using Resemblyzer and Bokeh to identify similar / distinct audio samples for use in TTS training · GitHub

You can ignore the substantial warning that comes from the UMAP reducer. Depending on the number of samples and the computer you use, it can take a while to run (so may be worth running through with a more limited number of .wav files initially just to be sure everything works). Takes about 40+ mins on my laptop.

When it has produced the scatter plot, navigate to the location of your wav files, and in that location start a local server (with the same port as used in the notebook):

python3 -m http.server

and then you should be able to click on the points in the chart and it’ll open them in another tab. I’ve seen people have code to make a Bokeh chart play the audio directly but I haven’t tried that yet (and this basically works well enough)

Here’s a screenshot of the scatter plot, with the two main clusters standing out quite clearly.

erogol · November 1, 2019, 11:38am

Would you be willing to adapt your notebook to https://github.com/mozilla/TTS/tree/dev/speaker_encoder? That’d be a great contribution! I already have a model trained on LibriSpeech with 900 speakers that I can share

nmstoker · November 1, 2019, 12:27pm

Yes, I’d be keen to give that a shot. I’ll have to look over the code there in a bit more detail and I’ll probably have a few questions.

erogol · November 2, 2019, 12:37pm

Feel free to ask questions as you like

alchemi5t · November 5, 2019, 9:24am

Hey Neil,do you remember which part took longer to run? I am trying to speed things up.

nmstoker · November 5, 2019, 9:34am

Roughly it was the looping over all the wav files with preprocess_wav(wav_fpath) that took the most, but the next two steps also took a decent amount of time but not quite as long I think. I’ll be trying to get some time to look at it this evening or tomorrow evening, so if I get updated timings I can share those.

alchemi5t · November 5, 2019, 9:40am

from multiprocessing import Pool
p=Pool(32)   # change number according to cores available
wavs=p.map(preprocess_wav,[i for i in wav_fpaths])

Try this out. should save you quite a bit of time.

P.S. you wont be seeing the tqdm prog bar though.

P.P.S.

This is pretty dope.

nmstoker · November 6, 2019, 8:44pm

The model link popped up a request for access in Google Drive, so submitted that a little earlier.

I plan to get the notebook working with your pre-trained model first (basically exactly as you have it, against LibriTTS data which I’ve got downloaded (somewhere!) on my desktop), then I’ll see about using with my own dataset to try to cluster it similarly as I did with my gist using bokeh

erogol · November 7, 2019, 8:16am

should be fine now…

Thx! I’ll be waiting for it.

nmstoker · November 11, 2019, 2:33am

I’ve made a pull request with the interactive plotting changes I got working earlier

github.com/mozilla/TTS

Add Bokeh interactive plotting

mozilla:dev ← nmstoker:bokeh-interactive-1

opened 01:37AM - 11 Nov 19 UTC

nmstoker

+632 -71

Some small changes to **compute_embeddings.py** to allow it to work for single-s…peaker corpus based on LJSpeech style file arrangement where it'll read in files from a pipe-separated .csv too Main change is **PlotUmapLibriTTS.ipynb** which now uses Bokeh for interactive plot of embeddings (either single speaker or multi-speaker), running a local server to enable playing the relevant wav file in the browser by clicking in the chart

It can handle either a single speaker or multi speakers. I slightly adjusted compute_embeddings.py so it could also pull in my corpus (basically same format as LJSpeech although I’ve only two columns in the csv which it can now optionally read)

Needs the speaker embedding model to be downloaded as per the Speaker Encodings README, install Bokeh, produce the embeddings, update the file locations in the notebook and you should be good to go.

There’s a quick screencast of it here too (sorry - the voice-over is a little too quiet and then the audio samples played are much louder!! Take care while listening Also cursor position seems off in the recording vs actual location)

erogol · November 11, 2019, 11:37am

Good video and PR! You can also add the video link to the encoder README.md if you like, in your PR.

nmstoker · November 11, 2019, 1:27pm

Sure, am away from my computer right now but I’ll do that this evening

Thanks again for this project - am really pleased I could make a small contribution by way of thanks for the huge efforts and skill you’ve put into it

Topic		Replies	Views
Fine-Tuning Trained Model to New Dataset TTS (Text-to-Speech)	13	4898	August 22, 2019
Are we now Headed towards the Brightest future in training efficiency? TTS (Text-to-Speech) feedback	4	793	July 7, 2020
Training suddenly dropping in quality TTS (Text-to-Speech)	20	2427	August 18, 2020
Data Requirements for Fine Tuning LJ Speech to learn my voice in English TTS (Text-to-Speech)	1	749	September 1, 2020
My Success with Mozilla TTS TTS (Text-to-Speech)	7	7103	January 21, 2021

Data and training considerations to improve voice naturalness

Related topics