What are the TTS models you know to be faster than Tacotron?

I believe we’ve done almost everything practically possible on Tacotron. Mozilla TTS has the most robust public Tacotron implementation so far. However, it is still slightly slow for low-end devices.

It is time for us to go for a new model. I just want to ask your opinion about what model we should use for this next iteration. You can also share some papers if you like.

3 Likes

I think these two would be worth a look, as their non-autoregressive approach makes them parallelizable:

1 Like

I guess the first model needs text-to-voice alignment information extracted before. And as any Google TTS paper they do not explain the real deal of the model which is the part that extract linguistic features from the text. I’d guess it is relatively harder to implement and train for different languages.

But thx for the second link. I didn’t know that

1 Like

What about Fastspeech 1&2 and DurIAN, though they need duration info for training.

2 Likes

The Springer guys claim to have a fast model, but as you know them, you probably know the repo as well :slight_smile:

2 Likes

Definitely don’t forget about LPCNET vocoder, there’s a paper on how to make it faster!
[https://arxiv.org/pdf/2005.05551.pdf] FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

@erogol Thanks for all your awesome work!

2 Likes

multiband-melgan already provides very fast inference. I think the bottleneck is the TTS model we need to update. But these are also good suggestions.

1 Like

Yes, it seems that featherwave is the next best option.

This maybe for the text to feature part ?

Speedyspeech has a RTF of about 0.2 to 0.25 on my PC (4 x core i5) without CUDA activated which is impressive and generated audio is good in general. If you feed it with longer sentences it gets unstable towards the end and one can hardly understand what is being said. Another disadvantage is the ‘bad’ performance on arm architecture which i observed.

Thanks for that feedback. When you say CUDA desactivated, that means you perform inference on your i5 CPU, or on a GPU ?

And yes, as someone said, “year of the vocoder” : https://arxiv.org/abs/2007.15256

Exactly, no GPUs involved.

I dont know who is willing to invest time and resources into implementations like VocGAN without any demos and pretrained models to test. But you are right, 2020 turns to be a wild zoo. My guess: these papers were written by highly developed artificial intelligence :slight_smile:

There is a reference in the VocGAN-paper - demo is here: https://nc-ai.github.io/speech/publications/vocgan/