I understand that you train the tts model first and then you train the vocoder.
They seem to be two stages that are both required for voice generation.
Then some talk here about a generic vocoder, so it seems that it does not rely on your dataset.
- When do you not need a new vocoder?
- How many steps do you train your model till you train the vocoder based on it?
- What do more vocoder steps improve, what do more model training steps improve?
- How do you realize that your dataset is flawed and how do you spot the flaw in it?