Front-end / Phoneme discussions

Following on from the issue here, I thought I’d upload two files to show a few things I’m running into (and attempts to figure out a way round them)

I’ve been using espeak-ng (rather than espeak, given it’s not maintained) and have added a few custom words manually added to it for words that commonly appear in my training material (so as not to confuse the model training it with audio that doesn’t correspond well to the phonemes provided).

I mostly get good results but there are some areas where the model has not learnt well - sometimes this is down to espeak-ng quirks but several times it’s the model being inconsistent. An example of the inconsistency with one of my better models is with “make”: at the start of short sentences, it’s clearly said as I would say it but for longer sentences it takes on a weird “maeek” kind of twist. That seems like something I need to fix with perhaps adding more training example audio (and verifying the quality of those samples which include make, just in case there’s an issue).

I started looking at how rhyming sets of words sounded to see if that distinguished flaws in the phonemes for particular words or for all words with a phoneme. I’ve found a few problem words that way but for the most part it seems to be that one or two particular phonemes just don’t get learnt properly - I still have more investigation to pin this down (but again it seems likely I’ll need to look at the training audio). An example in the output audio file attached is with the line near the beginning for “push, bush, cush, tush.” which generate these phonemes from espeak-ng: pˈʊʃ,bˈʊʃ,kˈʊʃ,tˈʊʃ and yet they all sound malformed.

One area I did see espeak-ng have problems was with a whole class of long “a” sounds. I’m using it with the UK-RP accent setting and it does make an adjustment (eg cast -> c-AH-st), but it overfitted this and you end up with pl-AH-stic, dr-AH-stic etc (which no-one says except for comedy value!) There probably is a way to fix this collectively in espeak-ng but it seemed fiddly so I ended up simply overriding each of the words with issues.

Lastly I’ve included some initial work with heteronyms I’ve been playing around with. Espeak-ng handles these to an extent (eg adjusting read to the correct tense in some cases) but it wasn’t quite consistent enough for my liking, so I looked into approaches to handle them separately. My rather simplistic attempt is to use a language model (KenLM) where I’ve manually marked the distinct word uses (appending “|1”, “|2” etc for the distinct forms). Whilst some of the heteronyms could have been handled by POS analysis I found this wasn’t that reliable and there are enough where the POS can’t distinguish alone (ie where both variants are say a noun) so the language model seems better able to help in such scenarios. As I say, it’s not that sophisticated so I expect smarter methods could easily out-perform it.

The way it works for a heteronym where there are two variants is to generate two versions of a sentence (with the heteronym word in the sentence tagged with the relevant |1 or |2 in each corresponding sentence) and then the LM simply gives the sentence probability, with the more probable one being used to overwrite the phonemes that espeak-ng gives. The way that overwriting works could work for a more general correction method for bad phonemes, but the issue I see with it is the fiddly nature of deploying this (I don’t mind messing with a compiles espeak-ng dictionary but it’s not viable if people are going to install this smoothly with pip!) (3.7 MB) (3.1 MB)

1 Like

Hi @nmstoker, interesting research you have been doing here - fortunately, German does not have such “heteronyms” afaik :slight_smile: But I find your LM way of doing very intriguing: So, how do you exactly mark your sentences with “|1” or “|2” then, manually in advance? And so KenLM is another preprocessing step in training/inference where it replaces the exact phoneme representation for words with “|1” etc. based on the most likely version according to its LM after espeak-ng has processed the sentence?

According to your extensive tutorial on espeak-ng handling here in the discourse forum, I have assembled a special espeak-ng dictionary for German lean words (mostly from French or English), around 10k words from wiktionary, where the phoneme pattern is overwritten, simply. This works quite well. And training - as you mentioned also - is faster and more stable! It also executes quite quickly, so except for highly time-constrained web services or the like I think it’s a great solution for TTS preprocessing. Thanks again for sharing your insights!

Thx for sharing your work. Yes the front-end part is a bit complicated especially for languages like English. Some alternative solutions might be:

  • Using POS tags as additional features to the TTS model. So that the model can learn the positional pronunciation differences if they are available in the dataset. This is done by couple of different TTS papers before.

  • Using an traditional statistical TTS methods may also be a solution since they work on phoneme aligned datasets you have more control over the whole system but then it mainly relies on the front-end so everything there should work perfectly. This is also probably why the most of the commercial providers use such an approach. Downside is that, these models does not sound as natural.

  • language model solution sounds good to me as well. But, as you figured before, as you replace components of the system with models you lose the control in fine details.

  • The best solution is to (in my opinion) use a well-curated text corpus to create a TTS dataset. Then the model hurdles all these by itseld.

@repodiac glad the German training’s going well.

Regarding the heteronyms you’re right: the KenLM is a step in my preprocessing (the last one).

The method is probably easiest to see in a screenshot:

Taking the input sentence it creates permutations of each variant of each heteronym present and runs that through KenLM. So in the case above the word “does” is found three times in the sentence, so each of those is tried with the three heteronyms for “does” (most commonly the verb, like “he does”, but also as in “hair does”/“fancy does” and lastly as in deer) and that creates 27 permuations to test in the KenLM model and it reckons here that the one for variant 8 (1, 3, 2) is most likely (-32.2 is highest score) so it picks that and then replaces the first instance with the phonemes for 1 (ˈdʌz), the second with those for 3 (ˈduːz) and the third with those for 2 (ˈdoʊz). It’s not perfect but works reasonably - you generally want a decent amount of data for each heteronym word and it has to cover a range of different word styles without letting one variant of the heteronym get too massively outnumbered.

@erogol - not sure if you saw but there’s a DeepMind paper out today (blog post / Arxiv) and rather helpfully they give a bit of detail on their phoneme handling in the Appendix (Appendix E and Table 3):

Appendix E Text preprocessing

We use phonemizer [7] (version 2.2) to perform partial normalisation and phonemisation of the input text (for all our results except for the No Phonemes ablation, where we use character sequences as input directly). We used the espeak backend (with espeak-ng version 1.50), which produces phoneme sequences using the International Phonetic Alphabet (IPA). We enabled the following options that phonemizer provides:

• with_stress, which includes primary and secondary stress marks in the output;
• strip, which removes spurious whitespace;
• preserve_punctuation, which ensures that punctuation is left unchanged. This is important because punctuation can meaningfully affect prosody.

The phoneme sequences produced by phonemizer contain some rare symbols (usually in non-English words), which we replace with more frequent symbols. The substitutions we perform are listed in Table 3. This results in a set of 51 distinct symbols.

I realise that just because they chose that path doesn’t mean it’s the best (:slightly_smiling_face:), but it’s interesting to see that with all their resources they stuck with espeak-ng & phonemizer. Maybe for some languages it’ll be less of an issue but I do think English is pretty awkward and the rules for edge cases can be hard to infer from text alone.

I’ll give it some more thought but in the short-term I plan to look a bit more at the phoneme output for my training data as I’d started already and see how my audio samples are for the less strong cases; I may have a go with the kind of mapping of the rare symbols they mention too to see what benefits I can gain.

Wow - fully automatic! Impressive. I am not familiar with KenLM, but I still don’t understand (probably the source code would help but I don’t want to bother you) how KenLM knows what your suffix marks (e.g. |1) mean? And how do you come up with the pattern, which words have to be marked up?

When you have filtered the most probable setting/permutation (according to KenLM’s score), where do you get the CORRECT phoneme pattern for the heteronym from and how do you find the precise position in the espeak-ng phoneme translation for replacing the original (i.e. maybe wrong) phoneme with the phoneme from KenLM’s highest scored permutation?

But really impressive - btw. do you do this as a professional or is it your freetime activity? :smiley:

Oh, and really thanks for the pointer! Sounds good :sunglasses:

Hi @repodiac - it’s done with some Python, it’s not part of KenLM, I merely call KenLM to get the sentence score. I’ve got a big list of heteronyms where I gathered a definition and phonemes manually for each variant, so that’s used to display the definition during “user testing” and to supply the phonemes that get swapped in. The code is a bit of a mess but it works okay.

Okay, but how does KenLM know that you specify different meanings of a heteronym, for instance: How does it know that does|1 is different from does|2? I also would expect that these suffixes are not part of the input to KenLM?

I have fed KenLM a file I produced where the words do indeed have |1 or |2 etc tagged on the end. Thus it will start to view sentences with similar context to those for |1 when the word is tagged |1 as more probable than if that same context was present but the word was tagged |2.

I see. So this requires still a significant amount of (manual) labour to come up with the right tags for each heteronym AND a “right” distribution of examples of those for training the KenLM language model, I suppose!?