What are the TTS models you know to be faster than Tacotron?

georroussos · October 29, 2020, 7:22pm

Me and @edresson1 made some more changes and it seems to be working better now. The fork is here https://github.com/george-roussos/hifi-gan the changes are in meldataset.py

The AP values are hardcoded (so check them out if your TTS has special processing attributes) but I will change it if it ends up working okay right now I am training it and it seems to be working okay. Will update further.

georroussos · November 4, 2020, 12:08pm

Update: hifiGAN is still training. It sounds much better than all other GANs. However, breath effects still metallic. Will see if it improves.

@dkreutz did VocGAN turn out successful?

dkreutz · November 4, 2020, 1:06pm

Yes and no. Groundtruth inference sounds quite good, but I can’t get it running with Mozilla-TTS Taco2 output. I am considering switching to HifiGAN too… when real-life allows me to…

mrthorstenm · November 4, 2020, 6:59pm

Thanks @georroussos for keeping us up to date.
Is metallic effect during voice (not breezing) less/better than with pwgan in your opinion?

georroussos · November 4, 2020, 11:08pm

Actually, for me breathing has only worked okay in PWGAN. Every MelGAN variant (including HiFiGAN) has this really annoying artifact in breathing. Otherwise the results I am getting with HFG are very good.

Now I am nearly done finetuning with ground truth mels. The metallic breathing did not go away.

zephyr · November 7, 2020, 6:47am

Anybody play around with DC-TTS to know how fast it is? Not sure I saw that get mentioned here yet and I’ve been curious how fast the inference performance is. See https://github.com/Kyubyong/dc_tts for an implementation.

georroussos · November 7, 2020, 10:58am

DC-TTS has not been updated for a long time so I doubt you’d get better results than Mozilla TTS which is constantly maintained.

Now I am finetuning my TTS with r=1 and running another finetuning session with r=2 and BN. With r=1 I am able to get rid of all background noise and have an almost super clear voice (with HiFiGAN), but metallic breathing is still there. Also, with r=1, my dataset is very fragile.

TheDayAfter · November 7, 2020, 11:36am

You are right regarding DC-TTS though its interesting that it outperformed Tacotron in some aspects, compare https://github.com/kyubyong/css10

Are you training a public dataset (LJSpeech) or your own one, what is your overall goal? Any audio samples to check quality?

Personally i am overwhelmed regarding the big amount of papers and implementations out there.

georroussos · November 7, 2020, 7:56pm

Tacotron2 can achieve impressive results and the benchmarking with LJSpeech does not really show this. With my dataset, which is far from TTS oriented, but has no background noise and completely matching transcriptions, I am able to synthesise speech of up to 5000 characters with minimal to no errors. My goal here is to make my TTS sound as natural as I can.

the secret to not being overwhelmed it to take it slow and try everything

gmtsehayneh · January 22, 2021, 10:48am

Dear All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see TransformerTTS is the fastest TTS. So can I train and use TransformerTTS using a CPU only ? so that I can use it for my research work.

gmtsehayneh · January 22, 2021, 10:50am

Hi All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see VocGAN is the fastest TTS. So can I train and use VocGAN using a CPU only ? so that I can use it for my research work.

nmstoker · January 22, 2021, 12:35pm

Hi @gmtsehayneh

Welcome to the forum

Your question is probably best directed to the developers of the repo you link to, which as far as I know are not associated with the TTS repo here. Likewise for the similar message you posted directly after too.

As a more general point regarding your GPU comment, if you don’t have access to a GPU directly then you may want to look into Google Colabs - it’s free but there are some additional challenges you’d need to work around as they only let the kernel run for 12 hours (so you’d need to save checkpoints before it expired, so you could continue progress when you restart). Best to Google for details about that as it’s somewhat off topic as well.

All the best with the MSc!

erogol · January 26, 2021, 12:48pm

just a note under this thread. I implemented speedyspeech and with multiband-melgan they provide the fastest TTS inference to my knowledge.

TheDayAfter · January 26, 2021, 3:07pm

Once available i think you can put Speedyspeech + Hifi-GAN to the list of fast TTS inferences.

Though they refer to TFLite this is a good comparison https://github.com/tulasiram58827/TTS_TFLite

erogol · January 26, 2021, 4:15pm

I didn’t know that repo. Thx for linking it… looks interesting.

Kirian · January 28, 2021, 11:22pm

Thanks for these nice informations about inference speed.

@erogol and others, we’d like to train (and release) models for many languages that can run about 100x realtime on a fast GPU. We use these models in our software to help language learners (language learning with Netflix). After reading this thread, the promising options seem to be FastSpeech2/GlowTTS/Speedyspeech + MB-Melgan/Hifi-GAN. Do you have any more specific advice for us?

Btw, the repo of SpeedySpeech looks promising. @erogol, when you say you implemented it, does it mean it’s somewhere on your TTS mozilla repo ? If no, would you like some help to do so ?

erogol · January 29, 2021, 1:11am

Any model would run that fast on GPU. Even the largest one except WaveRNN.

Yes, SpeedySpeech is implemented in the TTS repo apart from the original repo. So you can give it a shot.

Kirian · January 29, 2021, 3:02pm

Thanks, I found it.

Did you pretrained some models for it ? (speedyspeech)

Kirian · February 1, 2021, 6:06pm

Hi everyone, me again.

@erogol, you said that any model would run 100 RTF on modern GPU. However, after some benchmarks, I can’t get above 3 to 4 RTF (for the available pre-trained models of mozilla-TTS)

Hardware : 1 GPU RTX 2080

Tried config :

tacotron2 + MB melgan : ~ 3 RTF
glowTTS + MB melgan : ~ 3 RTF
speedy-speech + MB Melgan : ~4RTF

What is the combination of model that you made that reach this 100 RTF ?

Thanks

erogol · February 1, 2021, 6:12pm

I don’t know how you benchmark but these rtfs you shared what I get on a CPU.