What are the TTS models you know to be faster than Tacotron?

I believe we’ve done almost everything practically possible on Tacotron. Mozilla TTS has the most robust public Tacotron implementation so far. However, it is still slightly slow for low-end devices.

It is time for us to go for a new model. I just want to ask your opinion about what model we should use for this next iteration. You can also share some papers if you like.


I think these two would be worth a look, as their non-autoregressive approach makes them parallelizable:

1 Like

I guess the first model needs text-to-voice alignment information extracted before. And as any Google TTS paper they do not explain the real deal of the model which is the part that extract linguistic features from the text. I’d guess it is relatively harder to implement and train for different languages.

But thx for the second link. I didn’t know that

1 Like

What about Fastspeech 1&2 and DurIAN, though they need duration info for training.


The Springer guys claim to have a fast model, but as you know them, you probably know the repo as well :slight_smile:


Definitely don’t forget about LPCNET vocoder, there’s a paper on how to make it faster!
[https://arxiv.org/pdf/2005.05551.pdf] FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

@erogol Thanks for all your awesome work!

1 Like

multiband-melgan already provides very fast inference. I think the bottleneck is the TTS model we need to update. But these are also good suggestions.