I am quite new at this, and i have been reading tons of documentation about this, including the faq and wikis from https://github.com/mozilla/TTS
I was unsure if this is the right place to post this, or as a issue on the github repository, but since this is more like doubts in how to do things, probably it’s better here.
So far, just to gain experience on this, i am trying to train the dataset on the folder “\TTS\tests\data\ljspeech” using the TTS\tests\inputs\test_train_config.json file using tacotron2, like this :
python3 TTS/bin/train_tacotron.py --config_path tests/inputs/test_train_config.json
Unfortunatley i have no money to buy a powerfull gpu to train, so my only choice is to use CPU (INTEL CORE I9 9900KF non overclocked), which isnt as bad as i thought, it’s taking 10 seconds each step + 7 for the evaluation (i am unable to disable it with the run_eval, because the script throws an error which i think is related to using gradual training), but this is still quite acceptable to me (better than not doing training at all!), but it’s strange, since i am not yet training a vocoder, and only using CPU i expected this to be much longer.
The issue here, which i still haven’t yet understood correctly, is that we have to do two trainings, right? (i only want the same custom voice from the wav files, not new ones) One for the tts model using Tacotron2 for example (i suppose it’s the best one to choose), and then another one for the vocoder.
But i cant understand how to train other vocoders than then ones in TTS\TTS\bin :
Were’s the other ones, like ParallelWaveGAN, Multi-Band MelGAN, Full-Band MelGAN and MelGAN ? Unless those dont need training and are meant only for inference/speech synthesizing ?
And another thing, when i tested speech synthesizing with the tts model “tts_models/en/ljspeech/tacotron2-DCA” :
tts --text “Text for TTS”
and i checked the list in
I noticed that for the same TTS model the voice sounded quite different from vocoder to vocoder, it’s almost like it was another womans voice which made me confused, i thought the TTS models were supposed to have the same voice from the datasets they were built upon, but each vocoder i tried made the voice sound so different… What i want from all this is to have the exact same voice from the datasets, but now i am afraid of choosing a vocoder that could make the voice change almost to another person’s and only finding that out after training for days (still not sure how much seconds each step will take with the vocoders training.)
Sorry for all this but before last weekend i knew almost nothing about training tts voices!