Clear process for generating custom voice


Newbie here, so apologies if I’m missing the obvious. I am trying to achieve the following and think the steps are as follows, but have a few gaps. Please can someone help me fill in the gaps/answer the inline questions in Italics? And of course suggest any steps that may be missing. Maybe this can become a clear quick-start guide for this particular aim. Thank you.

Aim: To install Mozilla TTS on a Linux machine, and fine-tune a pre-trained LJSpeech with a new voice of my own.

1)Install cuda (this will allow the nvidia GPU to be used)
sudo apt-get install cuda

  1. Install Mozilla TTS using the Simple packaging method detailed here:
    Using package:

Check audio synth is working by running it up with:
python3 -m TTS.server.server
Open web page to http://localhost:5002. Enter some text and check the wav file is produced and can play OK.

Check CUDA is working by:
python3 -m TTS.train
Before the usage prompt you should see some info about CUDA:

Using CUDA: True
Number of GPUs: 1
If it’s working Using CUDA should be True, and Number of GPUs match what is installed in your machine.

  1. Record a set of wav files in the LJSpeech format - 22050Hz 16-bit Mono WAV
    Recommended duration:Single sentences 5 - 10 seconds each
    Remove any silence at the beginning and end of each recording.
    Normalize the audio level.
    Minimum number of files/combined duration: Help please

4)Store the wavs in a dataset folder with the following structure:
|- yourchosenname
|- metadata.csv
|- wavs
|- xxxx-0001.wav
|- xxxx-0002.wav

  1. Create metadata.csv inside your dataset folder with the following format:
    xxxx-0001|There were 50 people in the room.|There were fifty people in the room.
    xxxx-0002|Mr Jones is the friendly local butcher.|Mister Jones is the friendly local butcher.
    Using pipes to separate the 3 columns.
    Where the 3rd column expands numbers, titles etc.

Help please - is it necessary to split this into metadata_train.csv and metadata_val.csv? I saw suggestions val should be about 10% of the size of train?

  1. Pre-process Please Help - I think there should be some kind of preprocess stage here? But don’t see any .py for doing this in this repo? Is it done automatically as part of the training? I’m guessing so as there are params about do_trim_silence etc.

  2. Prepare a config.json for your new dataset
    Please help - I am not sure what needs changing in here. I’m thinking the following?:
    restore_path - set to the new dataset? But I’m not sure how to get the dataset into .pth.rar format?
    run_name - set to new dataset name (although probably not mandatory?)
    run_description - describe new dataset
    mel_fmin - set to ~50 for male, ~95 for female
    batch_size - 32 standard. I understand there are issues with GPU’s with smaller amounts of memory, and that you really need 16GB. Are there any recommendations (e.g. drop this value) if trying to use a 4-8GB GPU? Or are you just wasting your time?
    output_path - do these need changing?
    datasets - set the name and path to match your new dataset path?

  3. Fine tune the model:
    python3 -m TTS.train --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar

  4. Run up the server as in step 2 and test the new voice.

couple notes;

for step 1 it is way easier to use conda. You don’t manually deal with cuda which might be a problem.

you also need to find the right audio values for your dataset like silence threshold, normalization, etc. Our data analysis notebook can help.

You should also remove noisy samples and do some running based on quality. Again you can use CheckSNR notebook for this to start.

Also if you create your own dataset, you should also perform a phoneme coverage filtering to create your transcript set as representative as possible for the target language.

BTW, if there is any volunteer to create a nice script or notebook to automate these steps to make life easier for beginners, we can work on that together.

I’ve been wanting to find the answer to this myself.

In my experience for Dutch, 4.000 recordings were not sufficient for transfer learning using Tacotron DDC, using near to default configuration.

For completeness, I started off from this model, which was trained with 15.000 fragments.

can you share a couple of samples from your dataset to see the quality? I can comment better.

I appended some samples of the second dataset. Samples from the original dataset can be found on my dataset repo.