What are the TTS models you know to be faster than Tacotron?

Me and @edresson1 made some more changes and it seems to be working better now. The fork is here https://github.com/george-roussos/hifi-gan the changes are in meldataset.py

The AP values are hardcoded (so check them out if your TTS has special processing attributes) but I will change it if it ends up working okay :slight_smile: right now I am training it and it seems to be working okay. Will update further.

2 Likes

Update: hifiGAN is still training. It sounds much better than all other GANs. However, breath effects still metallic. Will see if it improves.

@dkreutz did VocGAN turn out successful?

Yes and no. Groundtruth inference sounds quite good, but I can’t get it running with Mozilla-TTS Taco2 output. I am considering switching to HifiGAN too… when real-life allows me to…

Thanks @georroussos for keeping us up to date.
Is metallic effect during voice (not breezing) less/better than with pwgan in your opinion?

Actually, for me breathing has only worked okay in PWGAN. Every MelGAN variant (including HiFiGAN) has this really annoying artifact in breathing. Otherwise the results I am getting with HFG are very good.

Now I am nearly done finetuning with ground truth mels. The metallic breathing did not go away.

Anybody play around with DC-TTS to know how fast it is? Not sure I saw that get mentioned here yet and I’ve been curious how fast the inference performance is. See https://github.com/Kyubyong/dc_tts for an implementation.

DC-TTS has not been updated for a long time so I doubt you’d get better results than Mozilla TTS which is constantly maintained.

Now I am finetuning my TTS with r=1 and running another finetuning session with r=2 and BN. With r=1 I am able to get rid of all background noise and have an almost super clear voice (with HiFiGAN), but metallic breathing is still there. Also, with r=1, my dataset is very fragile.

You are right regarding DC-TTS though its interesting that it outperformed Tacotron in some aspects, compare https://github.com/kyubyong/css10

Are you training a public dataset (LJSpeech) or your own one, what is your overall goal? Any audio samples to check quality?

Personally i am overwhelmed regarding the big amount of papers and implementations out there.

Tacotron2 can achieve impressive results and the benchmarking with LJSpeech does not really show this. With my dataset, which is far from TTS oriented, but has no background noise and completely matching transcriptions, I am able to synthesise speech of up to 5000 characters with minimal to no errors. My goal here is to make my TTS sound as natural as I can.

the secret to not being overwhelmed it to take it slow and try everything :slight_smile:

Dear All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see TransformerTTS is the fastest TTS. So can I train and use TransformerTTS using a CPU only ? so that I can use it for my research work.

Hi All,

I am from Ethiopia, I am working my MSc research on TTS, on one of ancient Ethiopian Language: Geez.

I faced a problem in getting a GPU to train the reset TTS models. And I see VocGAN is the fastest TTS. So can I train and use VocGAN using a CPU only ? so that I can use it for my research work.

Hi @gmtsehayneh

Welcome to the forum :slightly_smiling_face:

Your question is probably best directed to the developers of the repo you link to, which as far as I know are not associated with the TTS repo here. Likewise for the similar message you posted directly after too.

As a more general point regarding your GPU comment, if you don’t have access to a GPU directly then you may want to look into Google Colabs - it’s free but there are some additional challenges you’d need to work around as they only let the kernel run for 12 hours (so you’d need to save checkpoints before it expired, so you could continue progress when you restart). Best to Google for details about that as it’s somewhat off topic as well.

All the best with the MSc!

1 Like

just a note under this thread. I implemented speedyspeech and with multiband-melgan they provide the fastest TTS inference to my knowledge.

Once available i think you can put Speedyspeech + Hifi-GAN to the list of fast TTS inferences.

Though they refer to TFLite this is a good comparison https://github.com/tulasiram58827/TTS_TFLite

1 Like

I didn’t know that repo. Thx for linking it… looks interesting.

Thanks for these nice informations about inference speed.

@erogol and others, we’d like to train (and release) models for many languages that can run about 100x realtime on a fast GPU. We use these models in our software to help language learners (language learning with Netflix). After reading this thread, the promising options seem to be FastSpeech2/GlowTTS/Speedyspeech + MB-Melgan/Hifi-GAN. Do you have any more specific advice for us? :slight_smile:

Btw, the repo of SpeedySpeech looks promising. @erogol, when you say you implemented it, does it mean it’s somewhere on your TTS mozilla repo ? If no, would you like some help to do so ?

Any model would run that fast on GPU. Even the largest one except WaveRNN.

Yes, SpeedySpeech is implemented in the TTS repo apart from the original repo. So you can give it a shot.

Thanks, I found it.

Did you pretrained some models for it ? (speedyspeech)

Hi everyone, me again.

@erogol, you said that any model would run 100 RTF on modern GPU. However, after some benchmarks, I can’t get above 3 to 4 RTF (for the available pre-trained models of mozilla-TTS)

Hardware : 1 GPU RTX 2080

Tried config :

  • tacotron2 + MB melgan : ~ 3 RTF
  • glowTTS + MB melgan : ~ 3 RTF
  • speedy-speech + MB Melgan : ~4RTF

What is the combination of model that you made that reach this 100 RTF ?

Thanks

I don’t know how you benchmark but these rtfs you shared what I get on a CPU.