Trying to Understand the High-level Architecture

GuyEP · February 21, 2021, 11:11pm

After about a day of pouring over everything I’ve been successful at getting a custom LJSpeech data set together and am currently running CPU-based training using the default Tacotron config under /TTS/tts/configs/config.json. But, reading through recent posts under the issue queue and here in the forums, I’m realizing that there’s a lot I don’t yet understand.

For one, what’s the difference between a TTS and a vocoder? I understand that both need to be trained – are they trained separately, or together? In other words, after I’m done running this training with the Tacotron config, do I need to train all over again if I want to use MelGAN or WaveGrad? What does that process look like?

Right now I’m running training off the master branch; are there changes in the dev branch that would make this process better/faster/etc?

Since I’m running this in WSL2, I don’t have access to CUDA despite having an nVidia graphics card. Which component(s) need to talk to CUDA, and do you know of a way to make use of CUDA without an insider build of Windows 10?

bingecola · February 22, 2021, 9:52pm

As i understand, TTS takes text and generates an audio representation (visually a mel + its underlying data)
Vocoder takes that representation and translates it to the actual waveform which is audible.
Hence they are separate models with their own parameters.

IDK for sure, I fiddled with Tacotron 2 separately and used the published Vocoder for inference and things came out okayish (given the level of data i had and the number of steps i had put it through).
I then fine tuned the published vocoder with the same data set (ground truth), and some sentences sounded better and some sounded slightly worse. But the voice did change to resemble the target voice.
So you can definitely train them separately, but I am not sure what the optimal order is, or whether it is recommended to train them together. I had asked but never got an answer. There is a “recipe” section somewhere in Github, you may want to try that.

Yikes, good luck, hope you have a lot of time to spare.