What are the TTS models you know to be faster than Tacotron?

Vpr · October 27, 2020, 6:11am

Oh! glowtts and hifigan share some authors and their glowtts repo has been updated accordingly.

georroussos · October 27, 2020, 8:47am

I tried to use hifiGAN, but it looks like the Mozilla TTS spectrogram is not compatible with it, as it is and it needs changes. I had a quick look but was unsuccessful.

dkreutz · October 27, 2020, 11:29am

The same is true for VocGAN.

georroussos · October 27, 2020, 1:27pm

Were you able to fix it?

georroussos · October 28, 2020, 12:05pm

I am looking at HifiGAN again and it looks like the clue is in meldataset.py in the mel_spectrogram function and the way it is computed when spectrogram inversion is performed. I synthesized a spectrogram using Mozilla TTS and LJSpeech (an old model with no mean-var) and it still did not work with the LJSpeech HiFiGAN model (the sound is distorted). I grab the spectrogram by adding in synthesize.py, in the tts function:

spectrogram = torch.FloatTensor(postnet_output.T)
spectrogram = spectrogram.unsqueeze(0)
np.save("spectrogram.npy", spectrogram)

and it does produce waveforms using HiFiGAN, just not good quality. Does anyone have any idea on what needs to be changed? It really sounds good in the samples they give. I will try a training run and then finetune on GT alignments extracted using Mozilla TTS. It might work.

dkreutz · October 28, 2020, 3:33pm

Thanks for the code example, will try to adapt that to VocGAN.
It is all about the “melspec compatibility”. Ideally you should use same melspec-generator for Taco2 and vocoder with identical settings/configuration, e.g. min/max frequency, normalization, win/hop-length etc.

georroussos · October 28, 2020, 3:50pm

I use the same in both, but I haven’t still figured out the normalization scheme they do. Maybe that is the culprit.

sanjaesc · October 28, 2020, 5:15pm

You most likely need to adapt mozillas tts audio configuration and retrain the model for it to work. I think I’ll give it a try the coming weekend.

georroussos · October 28, 2020, 7:45pm

That’s great, you know your way around it more than I do I asked them some things on the repo, maybe they will reply. I think off them all GANs, hifiGAN looks to be the most promising one. I ran a session as it is, and after 3 hours the sound was clear.

TheDayAfter · October 29, 2020, 7:06pm

speedyspeech+hifiganv1
speedyspeech+hifiganv2
speedyspeech+hifiganv3

p.s. @erogol is right when he says that speedyspeech does not speak correctly in some cases. For instance, i had to replace “isn’t” with “is not”.

georroussos · October 29, 2020, 8:00am

These sound very good. We really have to get this to work with Mozilla TTS.

I made some progress in plugging it in but it’s not working. I will keep on working on it, but if anyone wants to help, let me know

georroussos · October 29, 2020, 3:07pm

Update. I finished training PWGAN on GT, but there is no difference, there is still sound ghosting and shaky voice.

I was able to implement Mozilla TTS spectrogram in hifigan, but I don’t know if it is correct. I will do a run now and check.

TheDayAfter · October 29, 2020, 7:11pm

Thank you for your code snippets for extracting the spectrogram. I used it for Speedyspeech. GlowTTS samples found here GlowTTS+HifiGAN sound much better than those which i generated. I will re-check this.

Maybe you can upload some samples or code how you utilized Mozilla TTS + HifiGAN?

georroussos · October 29, 2020, 7:22pm

Me and @edresson1 made some more changes and it seems to be working better now. The fork is here https://github.com/george-roussos/hifi-gan the changes are in meldataset.py

The AP values are hardcoded (so check them out if your TTS has special processing attributes) but I will change it if it ends up working okay right now I am training it and it seems to be working okay. Will update further.

georroussos · November 4, 2020, 12:08pm

Update: hifiGAN is still training. It sounds much better than all other GANs. However, breath effects still metallic. Will see if it improves.

@dkreutz did VocGAN turn out successful?

dkreutz · November 4, 2020, 1:06pm

Yes and no. Groundtruth inference sounds quite good, but I can’t get it running with Mozilla-TTS Taco2 output. I am considering switching to HifiGAN too… when real-life allows me to…

mrthorstenm · November 4, 2020, 6:59pm

Thanks @georroussos for keeping us up to date.
Is metallic effect during voice (not breezing) less/better than with pwgan in your opinion?

georroussos · November 4, 2020, 11:08pm

Actually, for me breathing has only worked okay in PWGAN. Every MelGAN variant (including HiFiGAN) has this really annoying artifact in breathing. Otherwise the results I am getting with HFG are very good.

Now I am nearly done finetuning with ground truth mels. The metallic breathing did not go away.

zephyr · November 7, 2020, 6:47am

Anybody play around with DC-TTS to know how fast it is? Not sure I saw that get mentioned here yet and I’ve been curious how fast the inference performance is. See https://github.com/Kyubyong/dc_tts for an implementation.

georroussos · November 7, 2020, 10:58am

DC-TTS has not been updated for a long time so I doubt you’d get better results than Mozilla TTS which is constantly maintained.

Now I am finetuning my TTS with r=1 and running another finetuning session with r=2 and BN. With r=1 I am able to get rid of all background noise and have an almost super clear voice (with HiFiGAN), but metallic breathing is still there. Also, with r=1, my dataset is very fragile.