After about a day of pouring over everything I’ve been successful at getting a custom LJSpeech data set together and am currently running CPU-based training using the default Tacotron config under /TTS/tts/configs/config.json
. But, reading through recent posts under the issue queue and here in the forums, I’m realizing that there’s a lot I don’t yet understand.
For one, what’s the difference between a TTS and a vocoder? I understand that both need to be trained – are they trained separately, or together? In other words, after I’m done running this training with the Tacotron config, do I need to train all over again if I want to use MelGAN or WaveGrad? What does that process look like?
Right now I’m running training off the master
branch; are there changes in the dev
branch that would make this process better/faster/etc?
Since I’m running this in WSL2, I don’t have access to CUDA despite having an nVidia graphics card. Which component(s) need to talk to CUDA, and do you know of a way to make use of CUDA without an insider build of Windows 10?