This document from Microsoft gives some general points about the challenges regarding preparing samples for creating a custom voice when using specifically recorded voice samples. Clearly it’s written with the idea of using their voice services, but the principles are fairly applicable to any TTS approach:
In cases where you’ve at least got clear audio and some form of transcript, it may be possible to use what’s called “forced alignment” - this takes the transcript and tries to figure out which sections of the audio correspond best to the text. There are several tools that do it, one I’ve tried with good results is Aeneas.
But the challenges @dkreutz mentions still remains - if a forced aligner has helped align the audio and text, you still need to verify it and that’s the tough part!
An example I came across recently, that’s like your point 3, was that dates can be read in different styles: even though I had a good alignment in general, from the text there was sometimes no way to be sure how the speaker had chosen to say it and in my case sometimes they said it in an American way (eg May 4th 2004 as “May fourth, two thousand four”) and others in a more English manner (“May the fourth, two thousand and four”). My solution then was to use grep to identify samples with dates and then listen to a load of them manually and adjust when the spoken words werw different to how my text normalisation had been applied. As a compromise I went for the easiest cases only, and I may further refine the dataset in this manner later, now that I know it gets passable results.
A guide I found helpful in this topic is here: https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The stated aim there is for a speech recognition dataset, but if you’re using good enough quality audio the same approach can work well for preparing a TTS training set. Bear in mind that “found” audio will be harder (possibly impossible) to use to get the best quality voice models from, although it’s clearly a more accessible option than hiring professional voice talent.