How to train my own tts model with more vocoders

I am quite new at this, and i have been reading tons of documentation about this, including the faq and wikis from https://github.com/mozilla/TTS

I was unsure if this is the right place to post this, or as a issue on the github repository, but since this is more like doubts in how to do things, probably it’s better here.

So far, just to gain experience on this, i am trying to train the dataset on the folder “\TTS\tests\data\ljspeech” using the TTS\tests\inputs\test_train_config.json file using tacotron2, like this :

python3 TTS/bin/train_tacotron.py --config_path tests/inputs/test_train_config.json

Unfortunatley i have no money to buy a powerfull gpu to train, so my only choice is to use CPU (INTEL CORE I9 9900KF non overclocked), which isnt as bad as i thought, it’s taking 10 seconds each step + 7 for the evaluation (i am unable to disable it with the run_eval, because the script throws an error which i think is related to using gradual training), but this is still quite acceptable to me (better than not doing training at all!), but it’s strange, since i am not yet training a vocoder, and only using CPU i expected this to be much longer.

The issue here, which i still haven’t yet understood correctly, is that we have to do two trainings, right? (i only want the same custom voice from the wav files, not new ones) One for the tts model using Tacotron2 for example (i suppose it’s the best one to choose), and then another one for the vocoder.

But i cant understand how to train other vocoders than then ones in TTS\TTS\bin :

train_vocoder_gan.py
train_vocoder_wavegrad.py
train_vocoder_wavernn.py

Were’s the other ones, like ParallelWaveGAN, Multi-Band MelGAN, Full-Band MelGAN and MelGAN ? Unless those dont need training and are meant only for inference/speech synthesizing ?

And another thing, when i tested speech synthesizing with the tts model “tts_models/en/ljspeech/tacotron2-DCA” :

tts --text “Text for TTS”
–model_name “///<model_name>”
–vocoder_name “///<model_name>”
–out_path folder/to/save/output/

and i checked the list in

tts --list_models

vocoder_models/universal/libri-tts/wavegrad
vocoder_models/universal/libri-tts/fullband-melgan
vocoder_models/en/ljspeech/mulitband-melgan

I noticed that for the same TTS model the voice sounded quite different from vocoder to vocoder, it’s almost like it was another womans voice which made me confused, i thought the TTS models were supposed to have the same voice from the datasets they were built upon, but each vocoder i tried made the voice sound so different… What i want from all this is to have the exact same voice from the datasets, but now i am afraid of choosing a vocoder that could make the voice change almost to another person’s and only finding that out after training for days (still not sure how much seconds each step will take with the vocoders training.)

Sorry for all this but before last weekend i knew almost nothing about training tts voices!

1 Like

Happy to see that you landed the right place eventually.

  1. without GPUs it is very time consuming to train models unfortunately. I suggest you to use at least Google Colab to begin with that provides some GPUs for limited usage.

  2. All *GAN vocoders are trained with train_vocoder_gan.py. You need to specify which one in the config.json file. Check some of the example config files.

  3. Not all vocoders are compatible with all the tts models. This is the reason for the difference b/w vocoders. You need to use the compatible ones. You either use the vocoder trained on the same language and the dataset or the universal vocoders.

Hope these helps !

Note: Pls use some formatting the next time in you post for codes and commands.

without GPUs it is very time consuming to train models unfortunately. I suggest you to use at least Google Colab to begin with that provides some GPUs for limited usage.

Unfortunately i probably wont have other choice if i can’t set up gpu encoding, because that’s the problem : “limited usage”… i never used google colab before, but unfortunately i have a lot of custom datasets i would like to train (even tough they are all small/medium sized, not as big as the ljspeech dataset) so i doubt i can do heavy use of their GPUS, but i will have to investigate that more.

But wait a minute… i DO have a nvida card, NVIDIA GEFORCE GTX 1060 SUPER 6GB, the problem is that i found it to be SUCH a hassle to install nvidia toolkit and in top of that finding that it’s not possible to use it without registering windows insider program and installing insider windows builds, which i really dont want to, but i am creating a separate topic for this, either reply here or there in case you can advise on that, i presume using this card is better than cpu, right? At least i seen people that did thousands of steps training even using this card, so it should be doable despite it’s low vram and lack of tensor cores.

All *GAN vocoders are trained with train_vocoder_gan.py. You need to specify which one in the config.json file. Check some of the example config files.

Not all vocoders are compatible with all the tts models. This is the reason for the difference b/w vocoders. You need to use the compatible ones. You either use the vocoder trained on the same language and the dataset or the universal vocoders.

Ok, i will see about that later after i solve the problem about using the gpu i currently have.