Data Requirements for Fine Tuning LJ Speech to learn my voice in English

So I have been going through the Issues on Github and questions here on discourse, and my understanding is, if I want to train Mozilla TTS on my own voice in English, the best approach is to fine tune the pertained model with new dataset.
Now I have a few questions regarding this

  1. How much data is needed for fine tuning, considering new data is also in English but will have a slightly different ascent (I am from Pakistan) and voice is male. Is 4-5 hours of clean good data enough ?

  2. So clean the dataset, give it similar structure to LJ Speech, update the config and start training ?
    Can some one provide some basic how to on getting started with Fine Tuning a pretrained model with my own dataset.

  3. For Dataset we require only Audio and Transcript right ? we dont need alignment?

Thank you. I know these are noobish questions, but I am starting out and I couldnt find answer to these questions.

Hi! Yes this sounds about right. 4-5 hours should be enough if the data is clean and you should have good results after 30k-40k additional steps.