Since training a vocoder takes time and compute, I’d like to train and contribute a universal vocoder that works for most use cases.
I have compute but I’m no expert on TTS and I’d like help choosing hyper parameters and tuning the config file.
I know this is possible since @erogol did it with WaveRNN and it worked very well.
I’d like to do the same with faster inference speed to cover more use cases by using either MelGAN or PWGAN on the same LibriTTS dataset.
According to my understanding, the sample-rate of the dataset used to train Tacotron doesn’t really matter because it shouldn’t affect the mel spectogram (I’m not so sure about that), the only parameters that should affect it are :
- num_mels (80)
- mel_fmin (50)
- mel_fmax (8000 to cover a scope as large as possible of speakers)
- spec_gain (20 but I suspect that diffrent gain on TTSvsVocoder should be fixable without retraining)
- fft_size (1024)
- win_length (1024)
- hop_length (256)
And so, still according to my understanding, these are the parameters that must be shared with all models that use the same vocoder.
The vocoder’s output sample rate shouldn’t matter too much but I think that 16kHz instead of the LibriTTS’s 24kHz should give a 33% boost in inference performance. (I’m not so sure about that as well since PWGAN and MelGAN are far more parallelised than WaveRNN)
What do you think about that ? Is it a good idea ? Am I off in my understanding of the TTS process ?