Gruut discussions

Following on from here Creating a github page for hosting community trained models

@synesthesiam - thanks a lot for those details. I will certainly have a go with gruut.

It sounds like we’ve got some similar aspirations around pronunciation!

Regarding the handling of read, I’d experimented earlier in the year with a simple (ish) language model approach that could generally infer the most applicable pronunciation for heteronyms by use in the context. It is hardly production ready but there are a few details in the thread below. The data I’d gathered is by no means complete, but demonstrates reasonable effectiveness. The manual set up is a clear downside but the upside is that it doesn’t need guidance with input text, it will make a pretty reasonable guess usually.

Generally my focus is on British English / RP so it’d be interesting to see how easily gruut could be adapted for that. I saw you had a way to add accents (eg French) but with so many different pronunciation differences, I’m wondering if a distinct en-UK dictionary might make sense (although clearly I’d need to figure out a way to source it!)


one option for heteronomy is to provide word embedding or some kind of semantic feature as an addition to the characters. Or an LM can translate the raw phonemes to context specific. I’ve not tried any of these but I saw the first one in papers and the latter is just my intuition.

I am also open to discover Gruut but since it only targets a limited set of languages, I don’t think it’d replace phonemizer. Maybe we could set it as an alternative if someone wants to put it inside the TTS.


I thought your username looked familiar, @nmstoker. I actually came across your post about using KenLM to predict word pronunciations a while back! It was part of the inspiration for me to create gruut :slight_smile:

Do you think forced alignment could be used to help avoid having to manually prepare the data you provide to KenLM? With audiobooks especially, I’d expect to be able to get some useful statistics about pronunciations in context. @erogol, do you have any other ideas of where to get this kind of data? Did the papers with word embeddings cite some specific data source?

Another option might be for me to add a phonemizer-compatible CLI around gruut.

But gruut also does text cleaning and tokenization – something I had hoped to separate out from MozillaTTS into a specialized package. I originally started with spaCy, but found it too restrictive with contractions and possessives.

Here’s one I came across:

The “accents” I can do with gruut are inspired by the IPA charts for various English locales. I just have an IPA map between languages for sounds that they don’t share, but which can be approximated. The same technique could be used with phonemizer, if you have a good idea of the underlying phoneme inventory of both languages.

What I’m really hoping to do with the “accent” idea is to apply it to speech recognition. If someone is a native French speaker, but is speaking U.S. English, I should be able to use a French acoustic model + a U.S. English lexicon with the phonemes mapped. No need to have corpora for all combinations of languages and accents :slight_smile:

To understand better, you basically want to train a french model and use it for english with the right IPA mapping. Is this correct?

What do you mean exactly by “useful statistics” ?

Correct. The idea (maybe incorrect) is that a French native speaker is approximating English sounds, so a French acoustic model should be more useful than an English one.

Frequencies of which heteronym is used in a particular context. The context could be surrounding words, part of speech, or (probably best) a word vector as you suggested. But we’d still need a ground truth dataset where words with multiple pronunciations have been disambiguated.

On the heteronym point, picking up ideas @synesthesiam mentioned I was wondering if some form of Word Sense Disambiguation might be a start to establish the words that could be said differently and then with analysis of a corpus of forced aligned audio/text it would be possible to pick out the audio for those words and detect the sets of pronunciation the different senses follow.

Then for inference, so long as we could determine the word sense in the same way, the relevant pronunciation could be looked up.

With the analysis stage, for eg “read” it would find a whole load of audio samples where the text had “read” present and some of those would be said to sound like reed and some like red.

Of course, the tough part here is finding a good way to disambiguate the usage! Often POS is a start, because there are sometimes differences between verb and noun usage, but I found various cases where that isn’t enough (or POS categorisation didn’t necessarily work). What I did with KenLM was simply put the sentences in with distinct tokens ( eg read_1 and read_2) for the different pronunciation cases and then when trying to figure out which sense is best, you feed your input sentence in with both forms of the token (once with _1 and once with _2) and compare the probability of each sentence that KenLM reports. I’m sure there are more sophisticated robust ways!

1 Like

I had wondered about using something like Allosaurus for the above approach, as it outputs phonemes directly, thus theoretically allowing distinction between the heteronym variants, but I’ve found it to be insufficiently accurate

Maybe a semi-manual approach whereby the heteronym audio is clustered would work: you take all the “read” audio, cluster is into two groups, then the user would simply need to label each cluster and that applies the label to the text, which gives a range of context examples.

Allosaurus looks pretty cool; it’s a shame it’s not terribly accurate :frowning:

I train Kaldi speech to text models on fairly large datasets, and one of the by-products of training is phonetic alignments for each utterance. It should be straightforward to produce a set of words (which have multiple pronunciations in the lexicon) alongside utterance text and the specific pronunciation(s) used.

From there, the KenLM approach would just be a matter of replacing “read” with “read_1” or “read_2” in the transcriptions and adding it to a corpus for training. For word vectors, I suppose a classifier would need to be trained.

I’m planning to train a U.S. English Kaldi model in the next few weeks. I’ll see if I can extract this type of data from the results. Do you know of any good UK English speech datasets? I have the ARU and M-AI Labs datasets, but it’s only a few GB.

Gruut might be able to help here too. It can (with some error) convert IPA to eSpeak’s phonemes. Maybe a set of eSpeak WAV files could be generated for heteronym pronunciations, along with sentences containing the words, and those could be presented to participants.

I would need to check how much there is from each narrator but I think there should be some UK English narrators within LibriTTS.

For these purposes it may be safest to pick UK narrators who share the same basic accent features (eg Northern speakers and Southern speakers differ in the length of their a sounds, amongst other characteristics) but if there’s enough audio from one individual then it’s just a matter of picking one that’s fairly representative.

1 Like

Quick update: Looking at LibriTTS I see now that there’s (roughly) 30 minutes per speaker at most and I have struggled to find British English accented speakers so far. Am taking a look at other sources - it seems like there’s a M-AILABS dataset for “Queen’s English” (ie British) so I’m investigating that now:

Update: the M-AILABS dataset for “Queen’s English” might well work - curiously the speaker isn’t actually English (she’s American, Elizabeth Klett) but she does a good English accent that seems pretty consistent with modern British English pronunciation. Some “give away” words aren’t in the source, because they’re older literary sources. Only minor difference is “often” which tended to be said more as “off-en” here but both uses are in use nowadays, so is hardly an issue!
The good thing about this is that it’s ~45 hrs of one speaker and a clear recording too, so it could work for training a voice as well as the pronunciation aspects.

1 Like

I’ve been able to extract phoneme alignments from my existing Kaldi models (not English yet, unfortunately). These are JSONL files, where each line is an utterance with words and the exact phonemes that Kaldi aligned.

My thinking is that we can take these files and train a classifier from word embeddings (or n-grams) to specific heteroynm pronunciations.

I’ll take a look at this. I attempted to train a voice with a British speaker speaking the Harvard sentences, but it didn’t work out. I assume it was too little data.

I’ve added en-gb to gruut locally, and will be seeing if I can train a voice using its phonemes. I’ll try a phonemizer-based voice as well once one of my GPUs is free :slight_smile:

1 Like