Contributing my german voice for tts

dkreutz · April 11, 2020, 12:25pm

Before I start Tacotron2 training on the complete dataset another question to the community regarding vocoders: besides the default Griffin-Lim (from librosa?) I see there are some more options available that may yield even better quality like WaveGlow, WaveRNN, MelGAN etc.

Which of the vocoder options would you choose today (April 2020)?
Should I consider anything for this when configuring Taco2, e.g. certain parameters or checkout a certain branch/tag?

erogol · April 15, 2020, 9:36am

WaveRNN is the best quality other GAN based models are for real-time inference. You don’t need to set any parameters specifically.

repodiac · April 16, 2020, 3:51pm

@mrthorstenm: I now checked out espeak-ng and its dictionary file for german:

github.com

espeak-ng/espeak-ng/blob/master/dictsource/de_list

// This file is UTF-8 encoded
// all words lower case

// Uses of $alt:
// 1.  Change ['i:] to [=I@] at end of word 
// 2.  age_  is French [A:Z@]


// Characters
//===========
// If a letter has a "word" pronunciation which is different from its
// "letter" name, then include the letter name here, with the letter
// prefixed by a _ character.

// Include a _ before a character if it's name should only be
// spoken when "speak punctuation" option is on.

_.	pUNkt
*	StErn	$max3
%	pro:ts'Ent	$max3

This file has been truncated. show original

Question: Do you guys plan on using this sort of file or what pronunciation patterns are you working with otherwise?

So, just to align with you guys and in anticipation you focus on espeak-ng patterns, I would suggest to proceed as follows:

1 - the file holds currently around 1000 words in total, so I would try to enlarge the word list significantly in general
2 - I would try to find some basic heuristics where automated creation of pronunciation patterns works out of the box OR where it comes close to be correct and needs only little correction by humans
3 - this, I try to do in a loop: extend the dictionary, find heuristics, test, extend again, check heuristics etc.
4 - if this does not work OR (“and-or”) I could write a simple web app where - similarly acting like Common Voice - one gets a espeak-ng sample played out loud in the browser (shouldn’t be that difficult since espeak-ng can create wav files) together with the current “educated guess” of the pronunciation pattern and then just corrects it if necessary. Such an app could be made public or simply distributed to whoever likes to help extending the dictionary.

What do you think?

dkreutz · April 16, 2020, 5:12pm

Yes, such a pronunciation dictionary is the way to go.

I am no expert in linguistics so I have no idea if heuristics would work.

sounds like a plan.

For Mycroft’s Mimic2-TTS there is already a web application available: https://mimic.mycroft.ai/pronounce
Maybe we can re-use that…

repodiac · April 16, 2020, 5:22pm

Great - if everyone agrees…
Thanks for the pointer, I know Mycroft but did not check out their app yet. Will do asap.

Btw: do you guys use Slack or similar for development? Discourse is nice (but CUMBERSOME regarding logging in…) but not really suitable for joint work, I think.

PS: … and I am curious how to be able to reply to (and beneath) single abstracts, like you did with my previous post @dkreutz ?

repodiac · April 16, 2020, 5:45pm

Ok, I checked out Mycroft’s Mimic Pronounce, as they call it - to me it does look like this is NOT open source!? I could not find the source code, neither on their github repo nor on the website. The “Recording Studio” is something else.

Someone else knows maybe anything?

dkreutz · April 16, 2020, 7:06pm

I just asked about the Pronounce-webapp in the Mimic channel of chat.mycroft.ai - repo is currently private.

@mrthorstenm and I are currently chatting on a private channel of the mycroft-chat.

Just highlight/mark the text and a little “quote” button should appear, click on it and the highligted text will be inserted to the reply textbox.

repodiac · April 16, 2020, 8:12pm

GREAT SCOTT! it works

Besides… would you like to switch to a more open chat so that more people can get involved actively?

mrthorstenm · April 17, 2020, 6:04am

Thanks @repodiac .
The phonemes mapping seems to be different in espeak-ng and mimic-pronounce.

eseak-ng
Montenegro —> mOnt@n’e:gro:

mimic-pronounce
Bremerhaven —> Brreamer-haw–fa-n

Should we use espeak-ng as “leading-syntax” on that?
All collected mappings should finally be merged into espeak to be available to all.

repodiac · April 17, 2020, 7:49am

Hi again - of course I haven’t followed all your discussions so far, so if you would be so kind to put me on the right spot here, thanks …

What is the point in using Mimic Pronounce if you cannot use it for your own data (or can you?) It is even not adaptable yet (because “closed-source”)
So, should I follow then my “plan” to extend the de_list from espeak-ng as described above? Regarding the web app, this wouldn’t be top priority on my list for now. But as a possible milestone if Mimic Pronounce is not available (or are there other tools someone knows of?)

dkreutz · April 17, 2020, 8:53am

As both - @mrthorstenm and me - are not on Slack and you seem to be interested in Mycroft anyway - would you mind joining us on Mycroft-chat channel ~language-de

repodiac · April 17, 2020, 10:04am

Alright, just logged in…

mrthorstenm · April 18, 2020, 9:34am

I followed @erogol 's appeal here.

First tacotron1 sample (training step 100k) is available here:
Soundcloud Link

I know, there’s much more training steps required, but i told you to keep this thread up to date

te0006 · April 18, 2020, 9:38am

Hello @mrthorstenm, I’ve been following your work for a while, a big thanks for your efforts! unfortunately the SoundCloud link seems to be broken though. Also, can we hope for a new release of your raw data any time soon? I did some promising experiments with your January data and would like to repeat them with the current version.
Edit: Soundcloud link works now, perhaps just a temporary outage.

mrthorstenm · April 21, 2020, 5:12am

Thanks for your nice words :-).
First i want to wait for the training and release the final tacotron 1 and 2 models.

dkreutz · April 23, 2020, 5:35pm

Just want to double check as I found this recommendation to enable trim_silence - do you think that is necessary for good WaveRNN results?

erogol · April 23, 2020, 7:28pm

if you trim silences, that might reduce the mode perf. on silences and pauses between words.

mrthorstenm · April 29, 2020, 4:28pm

Short update.

We (@dkreutz, @repodiac, @baconator, gras64 (from mycroft community) and me) are currently trying different parameter settings for mozilla based tacotron 1 and 2 training. Nothing ready to show right now, but we’re on the way.

In parallel @repodiac is trying to improve german pronunciation with kindly support and knowhow from @nmstoker.

Thank’s for all your amazing support. Hopefully we can provide a free to use german voice model in an acceptable quality.

mrthorstenm · May 12, 2020, 1:32pm

Could anybody help us interpreting results of dataset analysis by CheckDatasetSNR/AnalyzeDataset notebooks?

thorsten-de_text length vs STD thorsten-de_text length vs median audio duration thorsten-de_text length vs mean audio duration thorsten-de_text length vs instances

I’m unable to interpret the graph result of snr notebook.

no. of wav files: 20710
average SNR of the dataset:30.875177400417556

snr

If it’s helpful i can provide sample wavs being marked as good and worse. Maybe anyone can explain why which file is classified this way.

Any recommendations what to optimize or which files to remove (based on audio length) before running training?

Might it be a problem having a dataset with 20k phrases but only 264 phrases longer than 100 chars?

mrthorstenm · June 1, 2020, 6:40pm

Here’s another regular update on our progress.

@dkreutz and @baconator are still trying out different training configurations, while @repodiac is becoming an expert for German phonemes and espeak (-ng), providing great support
Since I am dissatisfied with the “text length vs STD” graphic from the “AnalyzeDataset” notebook, I returned to the microphone and record more phrases (around 3k phrases) with char length between 100 and 180. This will take some weeks.

Thanks so far, and stay tuned for futher updates.