How to start with TTS + WaveRNN?

Jalau · May 20, 2020, 11:57pm

I am new to the world of deep learning and all that stuff so forgive me for not knowing anything about it. But I am happy to learn.
So I have seen the model Tacotron2-iter-260K with a soundcloud link that sounds awesome. However having successfully deployed it after a lot of trouble shooting ended up being not as fulfilling as I expected it to be. It sounded much worse. Now after digging deeper I have noticed that it probably was used in combination with WaveRNN. But how do I continue from here? I setup WaveRNN with the pre trained modules from here:

How do I use it in combination with the tacotron2 model? How do I get the expected results as in the soundcloud links for the tacotron2 model?
And is there anyway to speed up the process of WaveRNN? I have a 2080 Super and a 3900x but both are chilling at 10% usage and WaveRNN takes like 2 minutes for one short sentence.
I am intending to use this for a real time text to speech application so I would be happy if there was a way to achieve speeds around like 5 seconds max. Thanks for any help in advance.

Best regards
Jalau

nmstoker · May 21, 2020, 1:29am

Hi @Jalau - welcome to the forum! Not sure which resources you’ve looked at but I’d suggest the models page (and ideally reading around the various bits of the wiki to pick up what you can)

The output will be improved when using a vocoder (ie WaveRNN but also PWGAN or MelGAN, these last two being faster for inference).

You’ve got a few options, including installing the self contained package at the bottom of the model page, or installing the particular models based on the relevant commits.

Another thing that might be handy if you’re starting out and want to see how things fit together is to look at some of the Colabs that people have linked to. There’s a couple of fairly recent ones on this page here: https://github.com/mozilla/TTS/issues/345

Jalau · May 21, 2020, 11:39am

Thanks for the quick answer @nmstoker! I haven’t fully read through the wiki yet so I will do that. However I already took a look at the model section.
I have already installed this model: https://github.com/mozilla/TTS/tree/Tacotron2-iter-260K-824c091
It works but the result is obviously not the same as in the example soundcloud link on the models page. I have also tried the pre packaged pip install for a different model and I got it running too but it sounded even worse than the Tacotron2 260k model.
So basically after a lot of troubleshooting I was able to start the demo server with the corresponding models howevever I don’t know how a vocoder can be added on top of that to improve the quality.
So basically how do I combine a vocoder with an existing TTS demo server to get a combined improved output from both?
Thanks in advance!

nmstoker · May 21, 2020, 3:25pm

High level: you can simply install the repo for the relevent vocoder and then adjust the configuration for the TTS demo server.

To actually do that, I was suggesting having a look at the Colabs as they literally do all the steps needed to get them to work together, so you’d see how they fit and could then mimic that locally.

Jalau · May 21, 2020, 4:09pm

@nmstoker Thanks! I will definitely take a look at those collab repositories. Is there a world where one could get those good sounding results in a matter of seconds or will it always take like a minute to render one sentence?

nmstoker · May 21, 2020, 4:55pm

I don’t have PC / GPU figures to hand. I’ve mostly used MelGAN and it’s reasonably fast (ie in the order of seconds). I think PWGAN is quick too and then WaveRNN is slower but generally the highest quality.

Jerson_Luiz_de_Paula_Junior · May 21, 2020, 8:40pm

Hello, I am also learning and use this example below that can be used in google colab with GPU:

https://colab.research.google.com/github/tugstugi/dl-colab-notebooks/blob/master/notebooks/Mozilla_TTS_WaveRNN.ipynb#scrollTo=klsVLR6w_u4P

othiele · May 22, 2020, 5:51pm

Thanks for the colab. I wanted to try a different vocoder (the universal wavernn from @erogol), but it is made for 16 KHz instead of my trained Tacotron with 22.5 KHz.

I am not the signals expert, any idea on how to downsample between the two or is there some other thing I’m missing here? Because now I have Donald Duck talking

georroussos · May 22, 2020, 6:06pm

You have two options. Either finetune WaveRNN with 22050 features (I have tried it and there is a lot of noise), or finetune Taco2 with 16khz audio, by restoring your TTS model.

othiele · May 22, 2020, 8:47pm

Thanks, I thought there might be an easy way … how foolish of me

georroussos · May 22, 2020, 8:54pm

I find vocoders in general to be tricky I spent a lot of time trying to train WaveRNN but I didn’t really get it work, so now I am trying to make PWGAN work. The original PWGAN repo has some super nice universal vocoders which you can try, if you trained your TTS model with mean-var normalization.