My approach is with espeak-ng - I’m afraid I haven’t used espeak for a while and had abandoned it for espeak-ng by the time I started getting more concerned about the pronunciation. It may be that a similar approach works with it, as they seem to share many command line options (due to their common heritage)
When I talk of compiling the dictionary it’s not the same as compiling the application - it’s just processing the dictionary(ies) into a format it can use internally.
There’s a lot you can pick up from the excellent guide that Josh Meyer put together here: http://jrmeyer.github.io/tts/2016/07/03/How-to-Add-a-Language-to-eSpeak-NG.html (which I believe then got largely used as direct inspiration for some of the espeak-ng documentation)
Summary of steps:
- Find (if it’s already present) or download the source for espeak-ng
- Within espeak-ng/dictsource/ you’ll see the various language dictionary files
- For each language, there are likely to be four:
– The emoji file - ignore this
– The rules file - eg en_rules which sets the general pronunciation, for now you can ignore it
– the list of words (mainly ones which are exceptions) - eg en_dict - this will be interesting to look in for inspiration but it’s usually not necessary to change
– the extras file - eg en_extras - this is the one you’ll usually want to add to, it’s intended for “user” specific words, ie domain specific words or those you want to override
Because of the order it looks at the files, it will usually end up following the extras file, so you can mostly just edit this.
For further details on how it works you can look at the espeak-ng docs, but some examples are below. You just add them as a list. In my case I’ve grouped them by a few basic types (abbreviations, names of people / places, general words) but that’s not necessary.
Example of an abbreviation:
RAC $abbrev $allcaps
This results in it being said letter by letter (only if it’s all in capitals), so “R A C” (not “rack”)
Example of abbreviation where you spell it out phonetically
Airbnb 'e@b,i:;nb'i:
Example of a regular word
boulevard b'u:lEv,A@d
Example of word based on other words
Coronavirus corona virus $text
When it sees “Coronavirus” is says it as “corona” followed by “virus” (otherwise it messes up the middle merging the “avir” part as if it weren’t distinct words.
Process
You can iteratively add words to the extra file, compile the dictionary and test it in espeak-ng fairly rapidly if you compile just the language of interest.
Compiling on my laptop takes less than a second for English.
To compile it’s this command (from within the dictsource folder):
sudo espeak-ng --compile=en
That yields output like this:
[sudo] password for <username>:
Using phonemetable: 'en'
Compiling: 'en_list'
5458 entries
Compiling: 'en_emoji'
1690 entries
Compiling: 'en_extra'
391 entries
Compiling: 'en_rules'
6743 rules, 103 groups (0)
To test it, I typically use espeak-ng directly as you’ll get an idea that it has used the right phonemes immediately and then you can test it with TTS more fully once you are done. When I’m testing with RP English, I use this to get an interactive session:
espeak-ng -x -v gmw/en-GB-x-rp
Then you type something and will see the second line as output (having typed the first line) and you’ll hear it speak the words aloud:
Coronavirus
k@r'oUna# v'aIr@s
If you want it to output IPA characters (which TTS gets via Phonemizer) then add –ipa thus:
espeak-ng -x --ipa -v gmw/en-GB-x-rp
However you need to take care not to get mixed up and paste IPA formatted characters into the extras file. It uses regular ASCII and a few symbols that map to the IPA, with the same input as above for Coronavirus yielding this:
Coronavirus
kəɹˈəʊnɐ vˈaɪɹəs
Thoughts
I use this mainly for the improved pronunciation at inference time, but it should obviously help to provide the correct IPA for words in your training dataset, rather than it give slightly wrong IPA and then have the model need to learn that actually it should say it a slightly unexpected way.
How necessary it is will depend on how well Espeak-ng does in general with words that you commonly use. They’ve done a great job in general and they necessarily have to make a compromise across variations of the accent which do exist. I found that whilst it was really quite good with RP English, it would get certain words consistently wrong in a way that sounded bad: eg plastic and drastic are short a words even in RP, so the a is not the same as in class in RP (with the latter being a long a) and it didn’t have a hope with obscure English or British place names!
By using phonemes you get the exact pronunciation desired and not some close approximation based on using regular letters alone.
The downside is that it’s clearly more involved and with a learning curve on the phoneme side. I suppose it could be automated but a separate dictionary approach would offer some ease of use advantages (as @mrthorstenm points out)
Related investigation: Something I’m looking at too is better handling of heteronyms (eg row as in row-boat, rhyming with “no” and row as in an angry row / argument, rhyming with “now”). Espeak-ng seems to manage some cases but it’s a little limited, so I’m trying out an idea where I use a KenLM model.
I train a KenLM language model with a load of example heteronym sentences where the relevant form in the text is given a suffix (eg |1 or |2) and then when I’m trying to have it predict the right pronunciation, I have the model tell me whether form |1 or form |2 is more likely for the input sentence and I substitute the relevant phonemes accordingly. This is done with code that intercepts the inputs to server.py - it’s in a rather basic / hacky state right now but does basically work
The main advantage is that it works for heteronyms where the POS (eg noun/verb etc) doesn’t give away the pronunciation, eg bases where both are nouns:
– plural of base (n.)
– plural of basis (n.)
And it avoids issues with POS parsing errors - whether it is overall more robust is something I’ll be testing in due course!