Tacotron2 can achieve impressive results and the benchmarking with LJSpeech does not really show this. With my dataset, which is far from TTS oriented, but has no background noise and completely matching transcriptions, I am able to synthesise speech of up to 5000 characters with minimal to no errors. My goal here is to make my TTS sound as natural as I can.
the secret to not being overwhelmed it to take it slow and try everything