Hi,
Newbie here, so apologies if I’m missing the obvious. I am trying to achieve the following and think the steps are as follows, but have a few gaps. Please can someone help me fill in the gaps/answer the inline questions in Italics? And of course suggest any steps that may be missing. Maybe this can become a clear quick-start guide for this particular aim. Thank you.
Aim: To install Mozilla TTS on a Linux machine, and fine-tune a pre-trained LJSpeech with a new voice of my own.
Steps:
1)Install cuda (this will allow the nvidia GPU to be used)
sudo apt-get install cuda
- Install Mozilla TTS using the Simple packaging method detailed here:
https://github.com/mozilla/TTS/wiki/Released-Models
Using package:
https://github.com/reuben/TTS/releases/download/ljspeech-fwd-attn-pwgan/TTS-0.0.1+92aea2a-py3-none-any.whl
Check audio synth is working by running it up with:
python3 -m TTS.server.server
Open web page to http://localhost:5002. Enter some text and check the wav file is produced and can play OK.
Check CUDA is working by:
python3 -m TTS.train
Before the usage prompt you should see some info about CUDA:
Using CUDA: True
Number of GPUs: 1
If it’s working Using CUDA should be True, and Number of GPUs match what is installed in your machine.
- Record a set of wav files in the LJSpeech format - 22050Hz 16-bit Mono WAV
Recommended duration:Single sentences 5 - 10 seconds each
Remove any silence at the beginning and end of each recording.
Normalize the audio level.
Minimum number of files/combined duration: Help please
4)Store the wavs in a dataset folder with the following structure:
|- yourchosenname
|- metadata.csv
|- wavs
|- xxxx-0001.wav
|- xxxx-0002.wav
- Create metadata.csv inside your dataset folder with the following format:
xxxx-0001|There were 50 people in the room.|There were fifty people in the room.
xxxx-0002|Mr Jones is the friendly local butcher.|Mister Jones is the friendly local butcher.
Using pipes to separate the 3 columns.
Where the 3rd column expands numbers, titles etc.
Help please - is it necessary to split this into metadata_train.csv and metadata_val.csv? I saw suggestions val should be about 10% of the size of train?
-
Pre-process Please Help - I think there should be some kind of preprocess stage here? But don’t see any .py for doing this in this repo? Is it done automatically as part of the training? I’m guessing so as there are params about do_trim_silence etc.
-
Prepare a config.json for your new dataset
Please help - I am not sure what needs changing in here. I’m thinking the following?:
restore_path - set to the new dataset? But I’m not sure how to get the dataset into .pth.rar format?
run_name - set to new dataset name (although probably not mandatory?)
run_description - describe new dataset
mel_fmin - set to ~50 for male, ~95 for female
batch_size - 32 standard. I understand there are issues with GPU’s with smaller amounts of memory, and that you really need 16GB. Are there any recommendations (e.g. drop this value) if trying to use a 4-8GB GPU? Or are you just wasting your time?
output_path - do these need changing?
datasets - set the name and path to match your new dataset path? -
Fine tune the model:
python3 -m TTS.train --config_path TTS/tts/configs/config.json --restore_path /path/to/your/model.pth.tar -
Run up the server as in step 2 and test the new voice.