Contributing my german voice for tts

Before I start Tacotron2 training on the complete dataset another question to the community regarding vocoders: besides the default Griffin-Lim (from librosa?) I see there are some more options available that may yield even better quality like WaveGlow, WaveRNN, MelGAN etc.

Which of the vocoder options would you choose today (April 2020)?
Should I consider anything for this when configuring Taco2, e.g. certain parameters or checkout a certain branch/tag?

1 Like

WaveRNN is the best quality other GAN based models are for real-time inference. You don’t need to set any parameters specifically.

2 Likes

@mrthorstenm: I now checked out espeak-ng and its dictionary file for german:

Question: Do you guys plan on using this sort of file or what pronunciation patterns are you working with otherwise?

So, just to align with you guys and in anticipation you focus on espeak-ng patterns, I would suggest to proceed as follows:

1 - the file holds currently around 1000 words in total, so I would try to enlarge the word list significantly in general
2 - I would try to find some basic heuristics where automated creation of pronunciation patterns works out of the box OR where it comes close to be correct and needs only little correction by humans
3 - this, I try to do in a loop: extend the dictionary, find heuristics, test, extend again, check heuristics etc.
4 - if this does not work OR (“and-or”) I could write a simple web app where - similarly acting like Common Voice - one gets a espeak-ng sample played out loud in the browser (shouldn’t be that difficult since espeak-ng can create wav files) together with the current “educated guess” of the pronunciation pattern and then just corrects it if necessary. Such an app could be made public or simply distributed to whoever likes to help extending the dictionary.

What do you think?

1 Like

Yes, such a pronunciation dictionary is the way to go.

I am no expert in linguistics so I have no idea if heuristics would work.

sounds like a plan.

For Mycroft’s Mimic2-TTS there is already a web application available: https://mimic.mycroft.ai/pronounce
Maybe we can re-use that…

Great - if everyone agrees…
Thanks for the pointer, I know Mycroft but did not check out their app yet. Will do asap.

Btw: do you guys use Slack or similar for development? Discourse is nice (but CUMBERSOME regarding logging in…) but not really suitable for joint work, I think.

PS: … and I am curious how to be able to reply to (and beneath) single abstracts, like you did with my previous post @dkreutz :slight_smile: ?

Ok, I checked out Mycroft’s Mimic Pronounce, as they call it - to me it does look like this is NOT open source!? I could not find the source code, neither on their github repo nor on the website. The “Recording Studio” is something else.

Someone else knows maybe anything?

I just asked about the Pronounce-webapp in the Mimic channel of chat.mycroft.ai - repo is currently private.

@mrthorstenm and I are currently chatting on a private channel of the mycroft-chat.

Just highlight/mark the text and a little “quote” button should appear, click on it and the highligted text will be inserted to the reply textbox.

GREAT SCOTT! it works :smiley:

Besides… would you like to switch to a more open chat so that more people can get involved actively?

1 Like

Thanks @repodiac .
The phonemes mapping seems to be different in espeak-ng and mimic-pronounce.

eseak-ng
Montenegro —> mOnt@n’e:gro:

mimic-pronounce
Bremerhaven —> Brreamer-haw–fa-n

Should we use espeak-ng as “leading-syntax” on that?
All collected mappings should finally be merged into espeak to be available to all.

Hi again - of course I haven’t followed all your discussions so far, so if you would be so kind to put me on the right spot here, thanks :slight_smile:

  • What is the point in using Mimic Pronounce if you cannot use it for your own data (or can you?) It is even not adaptable yet (because “closed-source”)
  • So, should I follow then my “plan” to extend the de_list from espeak-ng as described above? Regarding the web app, this wouldn’t be top priority on my list for now. But as a possible milestone if Mimic Pronounce is not available (or are there other tools someone knows of?)

As both - @mrthorstenm and me - are not on Slack and you seem to be interested in Mycroft anyway - would you mind joining us on Mycroft-chat channel ~language-de

Alright, just logged in…

1 Like

I followed @erogol 's appeal here.

First tacotron1 sample (training step 100k) is available here:
Soundcloud Link

I know, there’s much more training steps required, but i told you to keep this thread up to date :wink:

1 Like

Hello @mrthorstenm, I’ve been following your work for a while, a big thanks for your efforts! unfortunately the SoundCloud link seems to be broken though. Also, can we hope for a new release of your raw data any time soon? I did some promising experiments with your January data and would like to repeat them with the current version.
Edit: Soundcloud link works now, perhaps just a temporary outage.

Thanks for your nice words :-).
First i want to wait for the training and release the final tacotron 1 and 2 models.

Just want to double check as I found this recommendation to enable trim_silence - do you think that is necessary for good WaveRNN results?

if you trim silences, that might reduce the mode perf. on silences and pauses between words.

Short update.

We (@dkreutz, @repodiac, @baconator, gras64 (from mycroft community) and me) are currently trying different parameter settings for mozilla based tacotron 1 and 2 training. Nothing ready to show right now, but we’re on the way.

In parallel @repodiac is trying to improve german pronunciation with kindly support and knowhow from @nmstoker.

Thank’s for all your amazing support. Hopefully we can provide a free to use german voice model in an acceptable quality.

4 Likes

Could anybody help us interpreting results of dataset analysis by CheckDatasetSNR/AnalyzeDataset notebooks?

thorsten-de_text length vs STD thorsten-de_text length vs median audio duration thorsten-de_text length vs mean audio duration thorsten-de_text length vs instances

I’m unable to interpret the graph result of snr notebook.

no. of wav files: 20710
average SNR of the dataset:30.875177400417556

snr

If it’s helpful i can provide sample wavs being marked as good and worse. Maybe anyone can explain why which file is classified this way.

Any recommendations what to optimize or which files to remove (based on audio length) before running training?

Might it be a problem having a dataset with 20k phrases but only 264 phrases longer than 100 chars?

1 Like

Here’s another regular update on our progress.

@dkreutz and @baconator are still trying out different training configurations, while @repodiac is becoming an expert for German phonemes and espeak (-ng), providing great support :smile:
Since I am dissatisfied with the “text length vs STD” graphic from the “AnalyzeDataset” notebook, I returned to the microphone and record more phrases (around 3k phrases) with char length between 100 and 180. This will take some weeks.

Thanks so far, and stay tuned for futher updates.

5 Likes