What are the TTS models you know to be faster than Tacotron?

Oh! glowtts and hifigan share some authors and their glowtts repo has been updated accordingly.

I tried to use hifiGAN, but it looks like the Mozilla TTS spectrogram is not compatible with it, as it is and it needs changes. I had a quick look but was unsuccessful.

The same is true for VocGAN.

Were you able to fix it?

I am looking at HifiGAN again and it looks like the clue is in meldataset.py in the mel_spectrogram function and the way it is computed when spectrogram inversion is performed. I synthesized a spectrogram using Mozilla TTS and LJSpeech (an old model with no mean-var) and it still did not work with the LJSpeech HiFiGAN model (the sound is distorted). I grab the spectrogram by adding in synthesize.py, in the tts function:

spectrogram = torch.FloatTensor(postnet_output.T)
spectrogram = spectrogram.unsqueeze(0)
np.save("spectrogram.npy", spectrogram)

and it does produce waveforms using HiFiGAN, just not good quality. Does anyone have any idea on what needs to be changed? It really sounds good in the samples they give. I will try a training run and then finetune on GT alignments extracted using Mozilla TTS. It might work.

Thanks for the code example, will try to adapt that to VocGAN.
It is all about the “melspec compatibility”. Ideally you should use same melspec-generator for Taco2 and vocoder with identical settings/configuration, e.g. min/max frequency, normalization, win/hop-length etc.

I use the same in both, but I haven’t still figured out the normalization scheme they do. Maybe that is the culprit.

You most likely need to adapt mozillas tts audio configuration and retrain the model for it to work. I think I’ll give it a try the coming weekend.

That’s great, you know your way around it more than I do :slight_smile: I asked them some things on the repo, maybe they will reply. I think off them all GANs, hifiGAN looks to be the most promising one. I ran a session as it is, and after 3 hours the sound was clear.

speedyspeech+hifiganv1
speedyspeech+hifiganv2
speedyspeech+hifiganv3

p.s. @erogol is right when he says that speedyspeech does not speak correctly in some cases. For instance, i had to replace “isn’t” with “is not”.

These sound very good. We really have to get this to work with Mozilla TTS.

I made some progress in plugging it in but it’s not working. I will keep on working on it, but if anyone wants to help, let me know

Update. I finished training PWGAN on GT, but there is no difference, there is still sound ghosting and shaky voice.

I was able to implement Mozilla TTS spectrogram in hifigan, but I don’t know if it is correct. I will do a run now and check.

Thank you for your code snippets for extracting the spectrogram. I used it for Speedyspeech. GlowTTS samples found here GlowTTS+HifiGAN sound much better than those which i generated. I will re-check this.

Maybe you can upload some samples or code how you utilized Mozilla TTS + HifiGAN?

Me and @edresson1 made some more changes and it seems to be working better now. The fork is here https://github.com/george-roussos/hifi-gan the changes are in meldataset.py

The AP values are hardcoded (so check them out if your TTS has special processing attributes) but I will change it if it ends up working okay :slight_smile: right now I am training it and it seems to be working okay. Will update further.

2 Likes

Update: hifiGAN is still training. It sounds much better than all other GANs. However, breath effects still metallic. Will see if it improves.

@dkreutz did VocGAN turn out successful?

Yes and no. Groundtruth inference sounds quite good, but I can’t get it running with Mozilla-TTS Taco2 output. I am considering switching to HifiGAN too… when real-life allows me to…

Thanks @georroussos for keeping us up to date.
Is metallic effect during voice (not breezing) less/better than with pwgan in your opinion?

Actually, for me breathing has only worked okay in PWGAN. Every MelGAN variant (including HiFiGAN) has this really annoying artifact in breathing. Otherwise the results I am getting with HFG are very good.

Now I am nearly done finetuning with ground truth mels. The metallic breathing did not go away.

Anybody play around with DC-TTS to know how fast it is? Not sure I saw that get mentioned here yet and I’ve been curious how fast the inference performance is. See https://github.com/Kyubyong/dc_tts for an implementation.

DC-TTS has not been updated for a long time so I doubt you’d get better results than Mozilla TTS which is constantly maintained.

Now I am finetuning my TTS with r=1 and running another finetuning session with r=2 and BN. With r=1 I am able to get rid of all background noise and have an almost super clear voice (with HiFiGAN), but metallic breathing is still there. Also, with r=1, my dataset is very fragile.