One phoneme's pronunciation not matching dataset

I have a 14,000 sentence dataset (clean audio, single speaker, correctly transcribed) that I’m trying to model from. It aligned by 10k steps, now just past 90k. For the most part it’s sounding good.

Words with “ah” or ending with long ‘a’ tend to have a weird rolled-r sound after (it’s like pirate speak, but unwanted). I’ve listened to sentences in the dataset, and added a few to the test sentences, including words spoken correctly in the source, and they come out as “ar” when generated. “Athena sprang from the head of Zeus” would end up sounding like “Arthenar sprang…”

I should also add I’ve trained the same config parameters (other than dataset) with LJ and not had this issue.

Start over? Adjust the files we’re using in the dataset? Add even more sentences with the correct pronunciations? Everything else seems to be good, even extremely long sentences.

1 Like

Hi, what language is the set? Is it English? What dialect if English?

English, American, I’d call it General American English.

Weird that it does it. Does it happen anywhere in the recordings? What do you get if you try to phonemize the sentence that is synthesized wrong?

Try running echo "hello world" | phonemize -l en-us -b espeak

2 Likes

Yes, @georroussos’s method seems like the best way to determine if there’s anything odd going on with the phonemes produced for those test sentences. Confirming good phonemes for some of your more common words in the training set that start or end in “a” would also be worth doing just to double check no odd behaviour on that side too.

@baconator - I assume your config is definitely set to train with phonemes? (not just direct from letters)

1 Like

Phonemize output looks correct, and yes, using phonemes.

Going to restart from scratch, see how it goes.

Pulled down current commit of the repo, which has commented out the vocab/phoneme section, and will try it with that. Also yes, I’ve been clearing the phoneme cache as well between runs.

ETA for anyone from the future:
Using the updated repo and config file, I tried first with LJ (worked as expected) and private dataset, and keeping the the LJ phonemes. This seems to work so far (100k).

1 Like

Glad it appears to be working @baconator

I suspect that the keeping of the LJ phoneme files wouldn’t have had an effect. The caching process saves them with filenames corresponding to the original wav filename but with an .npy extension.

If you’d had a situation where the filenames from LJ Speech overlapped with those from your dataset then it would likely have messed things up (because it would read them in thinking they would be phonemes for the equivalent audio but would get some completely unrelated phonemes)
And if the filenames differed then they would coexist in the cache directory and when processing your dataset it wouldn’t read the ones for the LJ Speech filenames.

Hm. The updated repo was the only other thing I changed. Interesting.