So there are different TTS libraries out there and I see they all use different normalization methods for spectrogram normalization in model training.
Right now in Mozilla TTS what we do is the following;
- apply preemphasis to wav
- compute the spectrogram by stft
-
audio.amp_to_db()
- convert amplitude to decibel with20 * np.log10(np.maximum(min_level, x))
-
audio._normalize()
- normalize the spectrogram into the range [-4,4] assuming minimum db is -100
First of all, does anyone see any problem here?
The only obvious thing above is that preemphasis operation is hard to recover if you do batch inference. There is no straight forward implementation of it in CUDA since de-preemphasis operation has a temporal dependency in itself. Y
ou can approximate it by using RNN layers but it is slow. Hence, I guess it makes sense to drop preemphasis. This also makes our model incompatible with the latest vocoder models.
I see NVIDIA Tacotron implementation does not use any normalization except the amp_to_db operation
ESPNet, on the other hand, uses Standardization with mean and variance. It is good to compute normalization parameters from the target dataset (as in image recognition) to make the normalization flexible among different datasets. However, the downside is that each frequency level attains the same level of consideration by the model. I don’t think it is the right thing to do since for speech different frequency levels signify different aspects of the speech. Another downside is that in a multi-speaker model we need to compute mean-var stats separately per speaker. Which is a viable but an additional headache.
Also, I saw that using Standardization enables better vocoder models, especially with the new GAN based models.
I also started to experiment with Standardization and saw that the training seems more stable but the GL based results sound worse.
So I guess the better option is to use Standardization with a trained vocoder and our current normalization flow for GL.
These are all I know and I am kind of confused here. Please let me know if you have any take in this issue.