Oh ok I get why the sample rate of the TTS training data matters, it’s because for win_length, the unit of the 1024 is sample, so 1024 samples in 16k is not temporally the same as 24k. So I was wrong, input sr matters.
What I don’t get is why does the output sample rate of the vocoder matters, instead of focusing on variable output sample rate, shouldn’t we focus instead on variable input sample rate ? (since different sample rate produce different Mel Spectograms).
We could for exemple have the first few layers sr specific by switching it during trainning.
Or we could train a small model to do 16k Mel spectogram to 24k Mel spectogram (somthing like U-net)
Am I missing something @erogol ?
thanks
Edit: not u-net because the size of the spectrogram varies with duration, or windowed