Following on from the issue here, I thought I’d upload two files to show a few things I’m running into (and attempts to figure out a way round them)
I’ve been using espeak-ng (rather than espeak, given it’s not maintained) and have added a few custom words manually added to it for words that commonly appear in my training material (so as not to confuse the model training it with audio that doesn’t correspond well to the phonemes provided).
I mostly get good results but there are some areas where the model has not learnt well - sometimes this is down to espeak-ng quirks but several times it’s the model being inconsistent. An example of the inconsistency with one of my better models is with “make”: at the start of short sentences, it’s clearly said as I would say it but for longer sentences it takes on a weird “maeek” kind of twist. That seems like something I need to fix with perhaps adding more training example audio (and verifying the quality of those samples which include make, just in case there’s an issue).
I started looking at how rhyming sets of words sounded to see if that distinguished flaws in the phonemes for particular words or for all words with a phoneme. I’ve found a few problem words that way but for the most part it seems to be that one or two particular phonemes just don’t get learnt properly - I still have more investigation to pin this down (but again it seems likely I’ll need to look at the training audio). An example in the output audio file attached is with the line near the beginning for “push, bush, cush, tush.” which generate these phonemes from espeak-ng: pˈʊʃ,bˈʊʃ,kˈʊʃ,tˈʊʃ and yet they all sound malformed.
One area I did see espeak-ng have problems was with a whole class of long “a” sounds. I’m using it with the UK-RP accent setting and it does make an adjustment (eg cast -> c-AH-st), but it overfitted this and you end up with pl-AH-stic, dr-AH-stic etc (which no-one says except for comedy value!) There probably is a way to fix this collectively in espeak-ng but it seemed fiddly so I ended up simply overriding each of the words with issues.
Lastly I’ve included some initial work with heteronyms I’ve been playing around with. Espeak-ng handles these to an extent (eg adjusting read to the correct tense in some cases) but it wasn’t quite consistent enough for my liking, so I looked into approaches to handle them separately. My rather simplistic attempt is to use a language model (KenLM) where I’ve manually marked the distinct word uses (appending “|1”, “|2” etc for the distinct forms). Whilst some of the heteronyms could have been handled by POS analysis I found this wasn’t that reliable and there are enough where the POS can’t distinguish alone (ie where both variants are say a noun) so the language model seems better able to help in such scenarios. As I say, it’s not that sophisticated so I expect smarter methods could easily out-perform it.
The way it works for a heteronym where there are two variants is to generate two versions of a sentence (with the heteronym word in the sentence tagged with the relevant |1 or |2 in each corresponding sentence) and then the LM simply gives the sentence probability, with the more probable one being used to overwrite the phonemes that espeak-ng gives. The way that overwriting works could work for a more general correction method for bad phonemes, but the issue I see with it is the fiddly nature of deploying this (I don’t mind messing with a compiles espeak-ng dictionary but it’s not viable if people are going to install this smoothly with pip!)
audio1.zip (3.7 MB) Initial_demo_of_heteronym_examples_06Jun2020.zip (3.1 MB)