I need help understanding something.
I removed the linear spectrogram part of the loss function, along with postnet that generates it. I didn’t need the linear spectrogram for the vocoder and removing the linear spectrogram part save a LOT of GPU memory during training. However, the reduced model doesn’t produce reasonable attention even after 50K steps. For the full model, attention was reasonable after only a few thousand steps.
Why isn’t mel spectrogram part of the loss not enough to train the attention?