Are we now Headed towards the Brightest future in training efficiency?

Here we are, training tacotron 2 has become this swift breeze…In reality, leaving aside it is the route for the sweet spot in quality and training curves, it hurts to say swift breeze a bit far from the truth.

At this point clearly no need to remind the scale of importance of recent developments in TTS. Very fascinating to observe daily progress.

Still familiarizing what I am wondering:

#1 is LJ speech with transfer learning still the most efficient for high grade results?
#2 how important is gradual training in expediting training?

Is there an interesting point that’s missing from this discussion and considerations in the current state of things?


In all of my experiments, Taco2 has been much more stable and reliable than Taco. I have never been able to train Taco successfully. With that said, the dataset is extremely important and I have concluded that if it is good quality, you can get away with less hours of recording.

#1: It helps, but it is not the holy grail. Do not forget that LJSpeech is a hard dataset, i.e. there is a lot of background noise in it and the sibilant sounds are very high in frequency. The models that Eren has trained are another story and yes, these are extremely helpful, especially for English TTS.

#2: Gradual training always helps me.


Thank you for the pointers George. I see where you’re coming from, yes it’s clear that work by him is really impressive. Generally a very hard tradeoff in choosing a path, when quality is a key concern. It does feel like I might be overlooking the holy grail for English TTS here :slight_smile:

Given the number of variables I think it’s important that we try to share findings to help build up a general sense of how they interact and what generally good vs less good settings are. However at the same time we should take care not to completely rule out some choices just because they’ve been less effective with certain datasets - I suspect that what works for one dataset won’t always be the best for another dataset. Empirically testing so far as one can is the safest (although clearly there are practical limits there).

Adding yet more challenge is that what works well at one point in the repo history isn’t always necessarily the case later: I have found myself spending quite a bit of time recently trying to find the best settings for my own dataset with a really recent commit from the dev branch, so far finding many of the variations seem to be giving me worse outcomes than I had with a commit from February. When I have the new vocoder trained I’ll be able to get an overall picture (as right now it’s just results with GL that I’m having to rely on).

I’ll write up my various runs soon, but I was trying out the new DDC feature. I think I still need to explore a bit more about the normalisation as I’m now getting a lot more distortion in output audio samples with messages about the audio being clipped during training - this has happened both with DDC and when I turned it off. I have seen these before but they’re much more prevalent currently.

One other thing I was looking into is whether the model benefits from reducing the phoneme alphabet to just those phonemes that appear in the language being trained. English uses around 40-50 phonemes, so the extra ones seem superfluous. My Intuition was that cutting the phoneme alphabet would help but with my initial results it seems like it makes very little measurable impact (a tiny reduction in the model parameter count, making it about 0.1% smaller, and there’s a small reduction in training time). Assuming this holds up when I test it further, then my suspicion is that the model quickly learns to ignore the unused phonemes.

1 Like

Thank you Neil for sharing these perspectives. Great point here to direct attention to the importance of the case-by-case above a pure practicality approach to such a broad question :slight_smile: The runs on your current project sound very interesting and insightful as a reference point. Where would the write up become accessible? Thanks