What are the TTS models you know to be faster than Tacotron?

erogol · July 17, 2020, 11:54am

I believe we’ve done almost everything practically possible on Tacotron. Mozilla TTS has the most robust public Tacotron implementation so far. However, it is still slightly slow for low-end devices.

It is time for us to go for a new model. I just want to ask your opinion about what model we should use for this next iteration. You can also share some papers if you like.

nmstoker · July 19, 2020, 3:46am

I think these two would be worth a look, as their non-autoregressive approach makes them parallelizable:

GAN TTS:
https://arxiv.org/pdf/1909.11646.pdf
https://github.com/yanggeng1995/GAN-TTS
https://paperswithcode.com/paper/high-fidelity-speech-synthesis-with-1
ParaNet:
https://arxiv.org/abs/1905.08459
They have a blog post too, which doesn’t seem to list research papers but reading between the lines suggests it’s likely a further development of ParaNet: https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/
https://paperswithcode.com/method/paranet

erogol · July 20, 2020, 10:34am

I guess the first model needs text-to-voice alignment information extracted before. And as any Google TTS paper they do not explain the real deal of the model which is the part that extract linguistic features from the text. I’d guess it is relatively harder to implement and train for different languages.

But thx for the second link. I didn’t know that

mardan · July 22, 2020, 2:04pm

What about Fastspeech 1&2 and DurIAN, though they need duration info for training.

othiele · July 22, 2020, 2:32pm

The Springer guys claim to have a fast model, but as you know them, you probably know the repo as well

carlfm01 · July 24, 2020, 9:02pm

Definitely don’t forget about LPCNET vocoder, there’s a paper on how to make it faster!
[https://arxiv.org/pdf/2005.05551.pdf] FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

@erogol Thanks for all your awesome work!

erogol · July 25, 2020, 10:26am

multiband-melgan already provides very fast inference. I think the bottleneck is the TTS model we need to update. But these are also good suggestions.

petertsengruihon · September 16, 2020, 7:58am

Yes, it seems that featherwave is the next best option.

Tom_Schelsen · October 20, 2020, 4:58pm

This maybe for the text to feature part ?

TheDayAfter · October 20, 2020, 5:17pm

Speedyspeech has a RTF of about 0.2 to 0.25 on my PC (4 x core i5) without CUDA activated which is impressive and generated audio is good in general. If you feed it with longer sentences it gets unstable towards the end and one can hardly understand what is being said. Another disadvantage is the ‘bad’ performance on arm architecture which i observed.

Tom_Schelsen · October 22, 2020, 5:57pm

Thanks for that feedback. When you say CUDA desactivated, that means you perform inference on your i5 CPU, or on a GPU ?

And yes, as someone said, “year of the vocoder” : https://arxiv.org/abs/2007.15256

TheDayAfter · October 22, 2020, 8:29pm

Exactly, no GPUs involved.

TheDayAfter · October 22, 2020, 8:39pm

I dont know who is willing to invest time and resources into implementations like VocGAN without any demos and pretrained models to test. But you are right, 2020 turns to be a wild zoo. My guess: these papers were written by highly developed artificial intelligence

dkreutz · October 23, 2020, 6:44pm

There is a reference in the VocGAN-paper - demo is here: Demo page of "VocGAN"

TheDayAfter · October 25, 2020, 8:33am

Thank you Dominik. I would be okay with each of their presented versions of MelGAN, ParallelWaveGAN or VocGAN if the performance is right.

Vpr · October 25, 2020, 10:36am

The claimed results for this HiFi-GAN seem impressive:
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
paper, audio samples, source code, pretrained models

×13.44 realtime on CPU (MacBook Pro laptop (Intel i75
CPU 2.6GHz), they list MelGAN at ×6.59)

Seems like a better realtime factor than WaveGrad with RTF = 1.5 on an Intel Xeon CPU (16 cores, 2.3GHz).
Though more iterations (500k to 2500k) were used than in WaveGrad (6 to 1000).

TheDayAfter · October 25, 2020, 1:06pm

I have skimmed the paper, table 1 shows that their versions v2 and v3 of VocGAN are significantly faster than MelGAN and have a high MOS. I will definitely try it in the upcoming days and report back.

dkreutz · October 25, 2020, 5:56pm

I just started VocGAN training on the german dataset. My machine isn’t the fastest for training, so it will take approximately 4-5 days until the recommended 300 epochs are reached…

georroussos · October 25, 2020, 7:38pm

HifiGAN results sound very interesting. I think I will try a run later this week. Now I am training PWGAN on GT alignments.

TheDayAfter · October 26, 2020, 8:11pm

Sorry, i could not sleep. Nevertheless, i am looking forward for the results of the German jury.

glowtts+hifiganv1

vs

glowtts+hifiganv2

vs

glowtts+hifiganv3

v1 performs about realtime on CPU only, v2 and v3 are significantly faster, about 3-4 times.

Stay tuned.

Topic		Replies	Views
ForwardTacotron experience TTS (Text-to-Speech)	12	2554	May 16, 2020
Running TTS on constrained hardware (+ no GPU) TTS (Text-to-Speech)	39	11121	April 2, 2021
Sample Colab Notebook for faster than real-time speech synthesis on a CPU TTS (Text-to-Speech)	16	3018	July 17, 2020
Query regarding post processing TTS (Text-to-Speech)	49	2219	September 19, 2019
[TWEB dataset] TestSentence audio is progressing while synthesized audio is noisy TTS (Text-to-Speech)	1	317	July 28, 2020

What are the TTS models you know to be faster than Tacotron?

Related topics