I find that the pwgan works good, however there comes some hissing into the audio when trained too long (>700k steps), i believe some other people have also noted it.
Also, it is a bit metallic and requires fitting to the individual speaker for best results.
I might try WaveGrad some time…
If i find the time to integrate attentron into mozilla tts, i’ll let you know, thanks.
thanks for the idea of using SST to create a dataset. I’m already doing this, however it discards a significant portion of data, which is not optimal for speakers where there is no abundance of records.
I’m also interested in few- and zero shot TTS. Attentron is supposed to provide that, but for me, it’s not working: find attached a recording where i pre-trained my system on VCTK and then zero-shot (i.e. no gradient updates) Joe Biden: While it has some vague similarity, the model can’t even speak straight.
has anybody had some real success with google’s more advanced, auto-encoder based models? Such as this?