I have a 14,000 sentence dataset (clean audio, single speaker, correctly transcribed) that I’m trying to model from. It aligned by 10k steps, now just past 90k. For the most part it’s sounding good.
Words with “ah” or ending with long ‘a’ tend to have a weird rolled-r sound after (it’s like pirate speak, but unwanted). I’ve listened to sentences in the dataset, and added a few to the test sentences, including words spoken correctly in the source, and they come out as “ar” when generated. “Athena sprang from the head of Zeus” would end up sounding like “Arthenar sprang…”
I should also add I’ve trained the same config parameters (other than dataset) with LJ and not had this issue.
Start over? Adjust the files we’re using in the dataset? Add even more sentences with the correct pronunciations? Everything else seems to be good, even extremely long sentences.