So there are different TTS libraries out there and I see they all use different normalization methods for spectrogram normalization in model training.
Right now in Mozilla TTS what we do is the following;
apply preemphasis to wav
compute the spectrogram by stft
audio.amp_to_db() - convert amplitude to decibel with 20 * np.log10(np.maximum(min_level, x))
audio._normalize() - normalize the spectrogram into the range [-4,4] assuming minimum db is -100
First of all, does anyone see any problem here?
The only obvious thing above is that preemphasis operation is hard to recover if you do batch inference. There is no straight forward implementation of it in CUDA since de-preemphasis operation has a temporal dependency in itself. Y
ou can approximate it by using RNN layers but it is slow. Hence, I guess it makes sense to drop preemphasis. This also makes our model incompatible with the latest vocoder models.
I see NVIDIA Tacotron implementation does not use any normalization except the amp_to_db operation
ESPNet, on the other hand, uses Standardization with mean and variance. It is good to compute normalization parameters from the target dataset (as in image recognition) to make the normalization flexible among different datasets. However, the downside is that each frequency level attains the same level of consideration by the model. I don’t think it is the right thing to do since for speech different frequency levels signify different aspects of the speech. Another downside is that in a multi-speaker model we need to compute mean-var stats separately per speaker. Which is a viable but an additional headache.
Also, I saw that using Standardization enables better vocoder models, especially with the new GAN based models.
I also started to experiment with Standardization and saw that the training seems more stable but the GL based results sound worse.
So I guess the better option is to use Standardization with a trained vocoder and our current normalization flow for GL.
These are all I know and I am kind of confused here. Please let me know if you have any take in this issue.
This isn’t something I know about particularly but I did start reading around to see what others discussed etc. It might help to visualise the impact of different approaches on the spectrograms / audio waves before & after.
From what I understand the peak normalization approach is used here. Probably a loudness normalization (e.g. RMS) would be more feasible as the average level “seen by the algorithm” is more constant?
the problem with preemphasis is practical. Yes, it improves the results but it is hard to apply in inference time since there is no GPU implementation.
My latest experiments showed mean-var normalization works a bit better than the other methods. Now I am adding it to our repo as another alternative. You can also still use preemphasis as you like. I don’t drop the support for it.