I meant a small penalty term proportional to the intensity of each pixel to force the model to produce only the necessary sounds, but it won’t work since you say that there is no noise for the training set.
Maybe it’s an issue of robustness to different recording conditions and sound preprocessing, have you tried with an unseen LibriTTS speaker ?