Does anyone have any reasonable intutition about the normalization method used for the spectrograms in TTS training?

erogol · March 9, 2020, 3:07pm

So there are different TTS libraries out there and I see they all use different normalization methods for spectrogram normalization in model training.

Right now in Mozilla TTS what we do is the following;

apply preemphasis to wav
compute the spectrogram by stft
audio.amp_to_db() - convert amplitude to decibel with 20 * np.log10(np.maximum(min_level, x))
audio._normalize() - normalize the spectrogram into the range [-4,4] assuming minimum db is -100

github.com

mozilla/TTS/blob/master/utils/audio.py#L151


def apply_preemphasis(self, x):
    if self.preemphasis == 0:
        raise RuntimeError(" [!] Preemphasis is set 0.0.")
    return scipy.signal.lfilter([1, -self.preemphasis], [1], x)


def apply_inv_preemphasis(self, x):
    if self.preemphasis == 0:
        raise RuntimeError(" [!] Preemphasis is set 0.0.")
    return scipy.signal.lfilter([1], [1, -self.preemphasis], x)


def spectrogram(self, y):
    if self.preemphasis != 0:
        D = self._stft(self.apply_preemphasis(y))
    else:
        D = self._stft(y)
    S = self._amp_to_db(np.abs(D)) - self.ref_level_db
    return self._normalize(S)


def melspectrogram(self, y):
    if self.preemphasis != 0:
        D = self._stft(self.apply_preemphasis(y))

First of all, does anyone see any problem here?

The only obvious thing above is that preemphasis operation is hard to recover if you do batch inference. There is no straight forward implementation of it in CUDA since de-preemphasis operation has a temporal dependency in itself. Y
ou can approximate it by using RNN layers but it is slow. Hence, I guess it makes sense to drop preemphasis. This also makes our model incompatible with the latest vocoder models.

I see NVIDIA Tacotron implementation does not use any normalization except the amp_to_db operation

github.com

NVIDIA/tacotron2/blob/master/audio_processing.py#L78


    angles = angles.astype(np.float32)
    angles = torch.autograd.Variable(torch.from_numpy(angles))
    signal = stft_fn.inverse(magnitudes, angles).squeeze(1)


    for i in range(n_iters):
        _, angles = stft_fn.transform(signal)
        signal = stft_fn.inverse(magnitudes, angles).squeeze(1)
    return signal




def dynamic_range_compression(x, C=1, clip_val=1e-5):
    """
    PARAMS
    ------
    C: compression factor
    """
    return torch.log(torch.clamp(x, min=clip_val) * C)




def dynamic_range_decompression(x, C=1):
    """

ESPNet, on the other hand, uses Standardization with mean and variance. It is good to compute normalization parameters from the target dataset (as in image recognition) to make the normalization flexible among different datasets. However, the downside is that each frequency level attains the same level of consideration by the model. I don’t think it is the right thing to do since for speech different frequency levels signify different aspects of the speech. Another downside is that in a multi-speaker model we need to compute mean-var stats separately per speaker. Which is a viable but an additional headache.

Also, I saw that using Standardization enables better vocoder models, especially with the new GAN based models.

I also started to experiment with Standardization and saw that the training seems more stable but the GL based results sound worse.

So I guess the better option is to use Standardization with a trained vocoder and our current normalization flow for GL.

These are all I know and I am kind of confused here. Please let me know if you have any take in this issue.

nmstoker · March 15, 2020, 4:32pm

This isn’t something I know about particularly but I did start reading around to see what others discussed etc. It might help to visualise the impact of different approaches on the spectrograms / audio waves before & after.

Regarding dropping the pre-emphasis there were mentions in the context of speech recognition that it possibly wasn’t necessary any longer: https://www.quora.com/Why-is-pre-emphasis-i-e-passing-the-speech-signal-through-a-first-order-high-pass-filter-required-in-speech-processing-and-how-does-it-work
I’d be inclined to test it w/o, and it sounds like others most not be doing it too, given what you say about other vocoders.

dkreutz · March 15, 2020, 8:10pm

Yes, Pre-emphasis can be skipped.

Regarding normalization see https://en.wikipedia.org/wiki/Audio_normalization

From what I understand the peak normalization approach is used here. Probably a loudness normalization (e.g. RMS) would be more feasible as the average level “seen by the algorithm” is more constant?

erogol · March 17, 2020, 11:28am

the problem with preemphasis is practical. Yes, it improves the results but it is hard to apply in inference time since there is no GPU implementation.

My latest experiments showed mean-var normalization works a bit better than the other methods. Now I am adding it to our repo as another alternative. You can also still use preemphasis as you like. I don’t drop the support for it.