What are the TTS models you know to be faster than Tacotron?

I believe we’ve done almost everything practically possible on Tacotron. Mozilla TTS has the most robust public Tacotron implementation so far. However, it is still slightly slow for low-end devices.

It is time for us to go for a new model. I just want to ask your opinion about what model we should use for this next iteration. You can also share some papers if you like.


I think these two would be worth a look, as their non-autoregressive approach makes them parallelizable:

1 Like

I guess the first model needs text-to-voice alignment information extracted before. And as any Google TTS paper they do not explain the real deal of the model which is the part that extract linguistic features from the text. I’d guess it is relatively harder to implement and train for different languages.

But thx for the second link. I didn’t know that

1 Like

What about Fastspeech 1&2 and DurIAN, though they need duration info for training.


The Springer guys claim to have a fast model, but as you know them, you probably know the repo as well :slight_smile:


Definitely don’t forget about LPCNET vocoder, there’s a paper on how to make it faster!
[https://arxiv.org/pdf/2005.05551.pdf] FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

@erogol Thanks for all your awesome work!


multiband-melgan already provides very fast inference. I think the bottleneck is the TTS model we need to update. But these are also good suggestions.

1 Like

Yes, it seems that featherwave is the next best option.

This maybe for the text to feature part ?

Speedyspeech has a RTF of about 0.2 to 0.25 on my PC (4 x core i5) without CUDA activated which is impressive and generated audio is good in general. If you feed it with longer sentences it gets unstable towards the end and one can hardly understand what is being said. Another disadvantage is the ‘bad’ performance on arm architecture which i observed.

Thanks for that feedback. When you say CUDA desactivated, that means you perform inference on your i5 CPU, or on a GPU ?

And yes, as someone said, “year of the vocoder” : https://arxiv.org/abs/2007.15256

Exactly, no GPUs involved.

I dont know who is willing to invest time and resources into implementations like VocGAN without any demos and pretrained models to test. But you are right, 2020 turns to be a wild zoo. My guess: these papers were written by highly developed artificial intelligence :slight_smile:

There is a reference in the VocGAN-paper - demo is here: https://nc-ai.github.io/speech/publications/vocgan/

1 Like

Thank you Dominik. I would be okay with each of their presented versions of MelGAN, ParallelWaveGAN or VocGAN if the performance is right.

The claimed results for this HiFi-GAN seem impressive:
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
paper, audio samples, source code, pretrained models

×13.44 realtime on CPU (MacBook Pro laptop (Intel i75
CPU 2.6GHz), they list MelGAN at ×6.59)

Seems like a better realtime factor than WaveGrad with RTF = 1.5 on an Intel Xeon CPU (16 cores, 2.3GHz).
Though more iterations (500k to 2500k) were used than in WaveGrad (6 to 1000).

I have skimmed the paper, table 1 shows that their versions v2 and v3 of VocGAN are significantly faster than MelGAN and have a high MOS. I will definitely try it in the upcoming days and report back.


I just started VocGAN training on the german dataset. My machine isn’t the fastest for training, so it will take approximately 4-5 days until the recommended 300 epochs are reached…


HifiGAN results sound very interesting. I think I will try a run later this week. Now I am training PWGAN on GT alignments.

1 Like

Sorry, i could not sleep. Nevertheless, i am looking forward for the results of the German jury.






v1 performs about realtime on CPU only, v2 and v3 are significantly faster, about 3-4 times.

Stay tuned.

1 Like