Data Requirements for Fine Tuning LJ Speech to learn my voice in English

Hassan_Jalil · August 28, 2020, 8:05pm

So I have been going through the Issues on Github and questions here on discourse, and my understanding is, if I want to train Mozilla TTS on my own voice in English, the best approach is to fine tune the pertained model with new dataset.
Now I have a few questions regarding this

How much data is needed for fine tuning, considering new data is also in English but will have a slightly different ascent (I am from Pakistan) and voice is male. Is 4-5 hours of clean good data enough ?
So clean the dataset, give it similar structure to LJ Speech, update the config and start training ?
Can some one provide some basic how to on getting started with Fine Tuning a pretrained model with my own dataset.
For Dataset we require only Audio and Transcript right ? we dont need alignment?

Thank you. I know these are noobish questions, but I am starting out and I couldnt find answer to these questions.

georroussos · September 1, 2020, 12:08pm

Hi! Yes this sounds about right. 4-5 hours should be enough if the data is clean and you should have good results after 30k-40k additional steps.