So I have been going through the Issues on Github and questions here on discourse, and my understanding is, if I want to train Mozilla TTS on my own voice in English, the best approach is to fine tune the pertained model with new dataset.
Now I have a few questions regarding this
-
How much data is needed for fine tuning, considering new data is also in English but will have a slightly different ascent (I am from Pakistan) and voice is male. Is 4-5 hours of clean good data enough ?
-
So clean the dataset, give it similar structure to LJ Speech, update the config and start training ?
Can some one provide some basic how to on getting started with Fine Tuning a pretrained model with my own dataset. -
For Dataset we require only Audio and Transcript right ? we dont need alignment?
Thank you. I know these are noobish questions, but I am starting out and I couldnt find answer to these questions.