About model training, vocoder training and dataset error handling

I understand that you train the tts model first and then you train the vocoder.
They seem to be two stages that are both required for voice generation.
Then some talk here about a generic vocoder, so it seems that it does not rely on your dataset.

  1. When do you not need a new vocoder?
  2. How many steps do you train your model till you train the vocoder based on it?
  3. What do more vocoder steps improve, what do more model training steps improve?
  4. How do you realize that your dataset is flawed and how do you spot the flaw in it?

Please read the TTS wiki - a lot of your questions are adressed there.

Okay. 1-3 are not answered there.

You do not need a new vocoder if you find that the generic ones perform well enough, or you just don’t need a vocoder at all (so you are fine with Griffin Lim and low quality). The steps also depend on your liking really. But a rule of thumb is that until the reduction factor reaches 2, the spectrograms generated by TTS contain white noise. After it has been trained for ca. 50K steps with r=2, the quality improves, since TTS focuses on fine details. Training the vocoder more improves its performance, just like every other neural network. However, it does reach a plateau, after which improvements are not really observed (especially GAN vocoders). That also depends on dataset quality and whether TTS produces good quality spectrograms.

Generally, a TTS trained for about 500K steps coupled with a GAN vocoder trained for about 1M steps can give you good results. But it totally depends on your training set and the hyperparameter tuning (which you develop a knack for after extensively studying the engine).

2 Likes