I want to finetune a model on one speaker and then use it to synthesize speech

It’s easy to get data for that speaker (David Attenborough), background noise might be a problem though.

I’m not sure if it would be better to train from scratch or finetune?

And is mozilla TTS well suited for what I want to do? Or would you suggest something else?

I know ASR quite well but am new to TTS.

Also the link in the README is broken:

If you are new, you can also find here a brief post about TTS architectures and their comparisons.

Since it is English, try finetuning using the LJSpeech model. It will most probably work well. You can try freezing certain layers to speed up convergence, like the postnet or the stopnet.

1 Like