Contributing my german voice for tts

mrthorstenm · March 23, 2020, 11:53am

One question apart from tts training.
What’s best practice to let tts pronounce phrases in another way that it’s written?

Examples (written -> pronounce):
Mycroft —> Meikroft
Dr. —> Doktor
°C —> Grad Celsius
% —> Prozent
…

This isn’t limited to german pronunciation, cause this can happen with english words too.

Is this an aspect of phonemizer?

Or can this done by adding these strings to the „cleaner-process“?
Does this interact with training parameter “use_phonemes=true/false”?

othiele · March 23, 2020, 6:07pm

In STT you just keep English words in and they are learned in time, don’t know whether that works here, too.

As for the other stuff, I usually run a cleaner script. I find num2words helpful as well as this script here. But some words always slip throuhg the net …

erogol · March 24, 2020, 1:32am

You need a manual dictionary for this. There are some language-specific libraries but i could not find something general. Phonemizer is also a dictionary maybe a bit smarter.

nmstoker · March 24, 2020, 2:37am

I typically have been adding them to espeak and then recompiling the espeak English dictionary. It should work equally well in different languages.

Two main cases I see:

You want to give a word a specific phonetic pronunciation, either because the way espeak works it out is wrong or the way the word is said is illogical (eg Cholmondeley -> Chumley): just add the new phonetic spelling to the list
The word is said like another word or pair of words, you use the $text setting to tell espeak to say the word as two other existing words are said (eg coronavirus -> corona virus)

I can post more detailed instructions tomorrow morning.

I suppose you could possibly do it with an additional dictionary but I suspect you won’t get the fine control that selecting the desired phonetic inputs gives you.

The other thing that’s useful when comparing small adjustments like this is the ability to make server.py accept phonetic input as well. I’ve got a fairly simple patch to do this (it looks for phonetic letters that aren’t in the regular alphabet and then assumes that indicates the input is phonetic and doesn’t bother sending it to be converted by phonemizer).

mrthorstenm · March 24, 2020, 8:45pm

Thanks for your replies about additional dictionaries.
Since this should be a common issue for tts, i’m surprised that there seems to be no simple default way of doing this. Adding phrases to espeak dictionary and manually compile it seems not very consumer friendly, but okay for me.

@nmstoker Is espeak or espeak-ng used? “de_list” and “de_rules” exists in both versions. So appending additional lines in there should be no big deal. Would it be an option to modify espeak to load an additional dictionary from filesystem after compile time? This would make things more flexible.

http://espeak.sourceforge.net/download.html

nmstoker · March 24, 2020, 11:52pm

My approach is with espeak-ng - I’m afraid I haven’t used espeak for a while and had abandoned it for espeak-ng by the time I started getting more concerned about the pronunciation. It may be that a similar approach works with it, as they seem to share many command line options (due to their common heritage)

When I talk of compiling the dictionary it’s not the same as compiling the application - it’s just processing the dictionary(ies) into a format it can use internally.

There’s a lot you can pick up from the excellent guide that Josh Meyer put together here: http://jrmeyer.github.io/tts/2016/07/03/How-to-Add-a-Language-to-eSpeak-NG.html (which I believe then got largely used as direct inspiration for some of the espeak-ng documentation)

Summary of steps:

Find (if it’s already present) or download the source for espeak-ng
Within espeak-ng/dictsource/ you’ll see the various language dictionary files
For each language, there are likely to be four:
– The emoji file - ignore this
– The rules file - eg en_rules which sets the general pronunciation, for now you can ignore it
– the list of words (mainly ones which are exceptions) - eg en_dict - this will be interesting to look in for inspiration but it’s usually not necessary to change
– the extras file - eg en_extras - this is the one you’ll usually want to add to, it’s intended for “user” specific words, ie domain specific words or those you want to override

Because of the order it looks at the files, it will usually end up following the extras file, so you can mostly just edit this.

For further details on how it works you can look at the espeak-ng docs, but some examples are below. You just add them as a list. In my case I’ve grouped them by a few basic types (abbreviations, names of people / places, general words) but that’s not necessary.

Example of an abbreviation:

RAC $abbrev $allcaps

This results in it being said letter by letter (only if it’s all in capitals), so “R A C” (not “rack”)

Example of abbreviation where you spell it out phonetically

Airbnb 'e@b,i:;nb'i:

Example of a regular word

boulevard b'u:lEv,A@d

Example of word based on other words

Coronavirus corona virus $text

When it sees “Coronavirus” is says it as “corona” followed by “virus” (otherwise it messes up the middle merging the “avir” part as if it weren’t distinct words.

Process

You can iteratively add words to the extra file, compile the dictionary and test it in espeak-ng fairly rapidly if you compile just the language of interest.

Compiling on my laptop takes less than a second for English.

To compile it’s this command (from within the dictsource folder):

sudo espeak-ng --compile=en

That yields output like this:

[sudo] password for <username>: 
Using phonemetable: 'en'
Compiling: 'en_list'
        5458 entries
Compiling: 'en_emoji'
        1690 entries
Compiling: 'en_extra'
        391 entries
Compiling: 'en_rules'
        6743 rules, 103 groups (0)

To test it, I typically use espeak-ng directly as you’ll get an idea that it has used the right phonemes immediately and then you can test it with TTS more fully once you are done. When I’m testing with RP English, I use this to get an interactive session:

espeak-ng -x -v gmw/en-GB-x-rp

Then you type something and will see the second line as output (having typed the first line) and you’ll hear it speak the words aloud:

Coronavirus
k@r'oUna# v'aIr@s

If you want it to output IPA characters (which TTS gets via Phonemizer) then add –ipa thus:

espeak-ng -x --ipa -v gmw/en-GB-x-rp

However you need to take care not to get mixed up and paste IPA formatted characters into the extras file. It uses regular ASCII and a few symbols that map to the IPA, with the same input as above for Coronavirus yielding this:

Coronavirus
kəɹˈəʊnɐ vˈaɪɹəs

Thoughts

I use this mainly for the improved pronunciation at inference time, but it should obviously help to provide the correct IPA for words in your training dataset, rather than it give slightly wrong IPA and then have the model need to learn that actually it should say it a slightly unexpected way.

How necessary it is will depend on how well Espeak-ng does in general with words that you commonly use. They’ve done a great job in general and they necessarily have to make a compromise across variations of the accent which do exist. I found that whilst it was really quite good with RP English, it would get certain words consistently wrong in a way that sounded bad: eg plastic and drastic are short a words even in RP, so the a is not the same as in class in RP (with the latter being a long a) and it didn’t have a hope with obscure English or British place names!

By using phonemes you get the exact pronunciation desired and not some close approximation based on using regular letters alone.

The downside is that it’s clearly more involved and with a learning curve on the phoneme side. I suppose it could be automated but a separate dictionary approach would offer some ease of use advantages (as @mrthorstenm points out)

Related investigation: Something I’m looking at too is better handling of heteronyms (eg row as in row-boat, rhyming with “no” and row as in an angry row / argument, rhyming with “now”). Espeak-ng seems to manage some cases but it’s a little limited, so I’m trying out an idea where I use a KenLM model.

I train a KenLM language model with a load of example heteronym sentences where the relevant form in the text is given a suffix (eg |1 or |2) and then when I’m trying to have it predict the right pronunciation, I have the model tell me whether form |1 or form |2 is more likely for the input sentence and I substitute the relevant phonemes accordingly. This is done with code that intercepts the inputs to server.py - it’s in a rather basic / hacky state right now but does basically work

The main advantage is that it works for heteronyms where the POS (eg noun/verb etc) doesn’t give away the pronunciation, eg bases where both are nouns:
– plural of base (n.)
– plural of basis (n.)

And it avoids issues with POS parsing errors - whether it is overall more robust is something I’ll be testing in due course!

mrthorstenm · March 25, 2020, 12:18pm

Wow, thank you very much for this helpful post @nmstoker .

repodiac · March 26, 2020, 8:45am

Hi Thorsten - great and enduring work you are doing here! Thank you so much! I’m quite new to TTS, but I am looking for a useful TTS voice which is able to speak German.

I would like to assist you in your endeavour if I can. If there is any “grunt work” to do you would like to get help with just let me know. Thanks again!

mrthorstenm · March 27, 2020, 9:18pm

Hi @repodiac
Thanks for your nice words.

If you like, you can deal with the possibilities of how to do expand the espeak-ng dictionary for german word pronunciation as @nmstoker has written a really nice post on.

repodiac · March 29, 2020, 3:57pm

Hi again - sorry the notification regarding your posting got lost in the SPAM folder (…) Sure, if this helps I will look into it and get back as soon as I have results or issues Cheers and stay healthy!

repodiac · March 29, 2020, 4:04pm

After skimming the posting, wouldn’t it be useful to start with wikipedia/wikimedia contents for instance for abbreviations etc.?

Is there any file I should build upon or is it all “green field” for German right now? I’ll check in the next couple of days how I can make most progress fast in the first iteration of such a dictionary.

If you have any hints or points to be reminded of, just let me know. Thanks.

mrthorstenm · April 2, 2020, 10:06am

In my opinion this has two different aspects.

Firstly i would check on how anybody can extend the dictionary (content might be differ from person to person). So what’s the best way to create your own extended mapping (text -> pronounce) model.

Second would be to check if there’s already an existing common dictionary for german words which could be extended with more common words.

Here are some links found from @dkreutz and me that might be relevant on that.

http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b

github.com

espeak-ng/espeak-ng/blob/master/dictsource/en_list


// You can use the en_extra file, rather than this one
// to add your own pronunciation definitions.

// This file is UTF-8 encoded
// all words must be LOWER CASE  (although the initial letter will be automatically
//    converted if it's a 7bit ascii character)

//stress symbols  ' primary  , secondary  % unstressed

// Conditional rules
// ?2  Use voiceless [w#] for "wh"
// ?3  General American
// ?!3  Not General American
// ?5  split [3:] er [3:], ir [IR], ur [VR] 
// ?6  'one' = [wVn], 'of' = [Vv]  (now uses phoneme [02])
// ?7  Scottish
// ?8  Use full vowel, not schwa in some word endings

This file has been truncated. show original

erogol · April 3, 2020, 10:16am

another option could be training a phonemic RNN with a limited German dictionary. I guess, it’s particularly work good for German like languages where pronunciation has a consistent reliance on characters.

mrthorstenm · April 3, 2020, 2:14pm

The (good) end (part 1) is near

After 6 month of almost daily recording sessions i recorded more than 22.000 german phrases (primarily based on common voice corpus) with mimic-recording-studio which is more than 20 hours of pure audio.

When @dkreutz and i finalized the optimized ljspeech dataset the training is going to start. We’ll keep you updated on the progress.

For now, thank you all for your amazing support on this.

This is one of my last original recordings.
"Vielen Dank an die Communities von MyCroft und Mozilla, sowie alle OpenSource Projekte für ihr Engangement für die Menschheit."

Tranlated:

"Many thanks to the communities of MyCroft and Mozilla, as well as all open source projects for their commitment to humanity."

baconator · April 3, 2020, 10:19pm

Thank you, Thorsten. (Vielen Dank, Herr Mueller!)

repodiac · April 6, 2020, 2:33pm

Hi Thorsten, thanks for the hints. I still want to contribute but have other important stuff to finish before, unfortunately. I hope I can kickstart asap, though.

PS: The recordings sounds great - but how does the TTS version sound?

mrthorstenm · April 7, 2020, 7:57am

Take your time.

Training the dataset is about to start soon. Taking into account the computing power we hope to release the final model(s) somewhere around july this year.
Some “training-in-progress” samples might be shown until then.

dkreutz · April 11, 2020, 12:25pm

Before I start Tacotron2 training on the complete dataset another question to the community regarding vocoders: besides the default Griffin-Lim (from librosa?) I see there are some more options available that may yield even better quality like WaveGlow, WaveRNN, MelGAN etc.

Which of the vocoder options would you choose today (April 2020)?
Should I consider anything for this when configuring Taco2, e.g. certain parameters or checkout a certain branch/tag?

erogol · April 15, 2020, 9:36am

WaveRNN is the best quality other GAN based models are for real-time inference. You don’t need to set any parameters specifically.

repodiac · April 16, 2020, 3:51pm

@mrthorstenm: I now checked out espeak-ng and its dictionary file for german:

github.com

espeak-ng/espeak-ng/blob/master/dictsource/de_list

// This file is UTF-8 encoded
// all words lower case

// Uses of $alt:
// 1.  Change ['i:] to [=I@] at end of word 
// 2.  age_  is French [A:Z@]


// Characters
//===========
// If a letter has a "word" pronunciation which is different from its
// "letter" name, then include the letter name here, with the letter
// prefixed by a _ character.

// Include a _ before a character if it's name should only be
// spoken when "speak punctuation" option is on.

_.	pUNkt
*	StErn	$max3
%	pro:ts'Ent	$max3

This file has been truncated. show original

Question: Do you guys plan on using this sort of file or what pronunciation patterns are you working with otherwise?

So, just to align with you guys and in anticipation you focus on espeak-ng patterns, I would suggest to proceed as follows:

1 - the file holds currently around 1000 words in total, so I would try to enlarge the word list significantly in general
2 - I would try to find some basic heuristics where automated creation of pronunciation patterns works out of the box OR where it comes close to be correct and needs only little correction by humans
3 - this, I try to do in a loop: extend the dictionary, find heuristics, test, extend again, check heuristics etc.
4 - if this does not work OR (“and-or”) I could write a simple web app where - similarly acting like Common Voice - one gets a espeak-ng sample played out loud in the browser (shouldn’t be that difficult since espeak-ng can create wav files) together with the current “educated guess” of the pronunciation pattern and then just corrects it if necessary. Such an app could be made public or simply distributed to whoever likes to help extending the dictionary.

What do you think?