I agree with Baconator, see what works best (varying model run settings to get a feel for how it impacts results is important, you’re unlikely to strike jackpot the first time )
Your amount of audio data seems sensible.
I would give some thought to the content / material you have them read. A couple of points to watch for:
- try to get it quite diverse, covering the styles of speech it could be used for
- include questions, statements; formal and casual text
- balance the length (look at the Notebooks in TTS as they help with distribution of lengths)
- aim for a wide vocab if you can
- the key part in vocab is trying to provide the model with examples of all the distinct sounds and, as much as possible, all combinations of sounds (if it didn’t ever hear eg “D” in training it won’t know what to do in inference time, when it’s creating output)
For your readers, it’s ideal to have them aim to generally give a middle of the road spoken style as consistently as possible.
- it needn’t be a monotone lacking all emotion but you want fairly consistent style for similar speech otherwise the model can get confused
- likewise with volume and pace, aim for consistent delivery
- I gather with voice pro’s it’s worth having a couple of their good samples ready to play at the start of each session, so they can pick up that style again easily before they start ploughing through hours of recordings
- clarify the importance of matching the transcript (if it says “he is” they need to avoid the temptation to say “he’s” or the model will struggle with the mismatch); if this happens by mistake and is found after, you can update the transcript to be consistent but that’s more work than avoiding it generally
I think you’re right to capture high quality wave files. However typically I’ve seen most people use 22 or 24kHz samples for training, so you may want to have a go down sampling for a run to see how it works (it should be quicker to train)
It will probably help too if they can record in a room without too much reverb.
Other than that, I’d have a browse over posts here to see what other points have come up for scenarios that sound like what you’re trying to do.
Hope it goes well! I’d be keen to follow how it works out.