Contributing my german voice for tts

repodiac · August 24, 2020, 2:52pm

Hi guys, just to let you know if you are interested. Feel free to check out my recent upload of https://github.com/repodiac/espeak-ng_german_loan_words - it is a brief tutorial with code where you can automatically create an additional dictionary for espeak-ng with ~10k German loan words.

These may improve TTS preprocessing when using phonemes (because loan words are correctly pronounced then, not “german-ized”)

TheDayAfter · August 28, 2020, 2:52pm

@erogol @mrthorstenm Many thanks for computing respective donating. I have several questions:

Below you see the performance on a PC with quadcore i5 CPUs turning CUDA off, two out of four cores seem to be used during computation

sentence = “In Deutschland starben bislang zwar weniger Menschen an Covid-19 als etwa in Belgien oder Großbritannien. Eine neue Studie zeigt jedoch: Bei Patienten, die ins Krankenhaus mussten, sind die Verläufe überall ähnlich.”
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)
```
 (333312,)
  > Run-time: 36.56012582778931
  > Real-time factor: 2.4186008842661977
  > Time per step: 0.00010968712780813468
```

How can i improve this?

Can this kind of stuff run in realtime on a Jetson Nano?
Remark: The word ‘mussten’ was pronounced as ‘müssten’ and ‘Verläufe’ as ‘Verlaufe’

dkreutz · August 28, 2020, 3:21pm

to improve performance turn on CUDA
@mrthorstenm experimented with synthesis on RPI3 but it is far from realtime, I ran older Taco2 release on Jetson Nano and RT-factor was 1:5 up to 1:10 (1sec audio required 5-10sec processing). In the meantime there is a Tensorflow version of Taco2. Using this and coverting model to Tflite may improve performance on SBCs like RPI and Nano.
this is a known issue, unfortunately this model was trained with wrong phoneme cleaner configuration. A new model is in the works (no publishing date yet)

mrthorstenm · August 28, 2020, 3:26pm

Here’s the post @dkreutz mentioned.

TheDayAfter · August 28, 2020, 4:26pm

I missed your post Dominik. Your numbers are not promising for using a Jetson Nano for TTS purposes like described here at least not for realtime applications.

mrthorstenm · September 5, 2020, 9:05am

Hello.

It’s time for another short update.
We’re currently preparing a new training run and figured out some “issues” with phoneme handling with mixed english/german wording.

This warning occurs quite often:

[WARNING] fount 1 utterances containing language switches on lines 1
[WARNING] extra phones may appear in the "de" phoneset
[WARNING] language switch flags have been kept (applying "keep-flags" policy)

We analyzed where these warnings come from and found out that our dataset (metadata.csv) contains several (408) phrases with non-native-german words, being common in german every day language.

Some examples:

server
opensource
song
chat
team
computer
party
cool
…

Just a few phoneme samples of default config (keep-flags):

Auf der Couch könnte sie es sich gemütlich machen.
- aʊf dɛɾ (en)kaʊtʃ(de) kœntə ziː ɛs zɪç ɡəmyːtlɪç maxən
Wie kann man den Song so verschandeln?
- viː kan man deːn (en)sɒŋ(de) zoː fɛɾʃandəln
Nicht alle Teenager sind so.
- nɪçt alə (en)tiːneɪdʒə(de) zɪnt zoː
Währenddessen spricht sie mit ihrem Computer.
- vɛːrəndɛsən ʃpɾɪçt ziː mɪt iːrəm (en)kəmpjuːtə(de)

Currently we’re in discussion if we should run training with default option “–language-switch keep-flags” (with these warning to be produced) or if we should run training with disabled phoneme usage in config file.

Wishing you all a nice weekend

mrthorstenm · September 7, 2020, 11:12am

Hi @erogol
It’s a funny coincidence that your latest dev commit (https://github.com/mozilla/TTS/commit/4f3917b9a673a4039e577a8098f545978df5ea2f) matches our current group internal discussion on “keep-flags”
(See above post)

erogol · September 7, 2020, 12:15pm

For real-time we need to train MultiBand-MelGAN as the vocoder. Then we can run it real-time on CPU. MB-MelGAN + Taco2 is 1.45 real-time factor on Raspi 3 and slightly faster if you use TFLite.

erogol · September 7, 2020, 12:16pm

The PR merged dealing with the flags option.

mrthorstenm · September 7, 2020, 12:26pm

So, you defined “remove-flags” as default. I initially though keeping these flags would make pronounciation better for german sentences with “english” words in it. So will english words be pronounced “in english” without the flags?

eg.: “Wie kann man den Song so verschandeln?”
viː kan man deːn (en)sɒŋ(de) zoː fɛɾʃandəln

erogol · September 7, 2020, 12:26pm

yes it’s supposed to be like that

mrthorstenm · September 7, 2020, 12:34pm

So we’ll run next training without the flags and check pronounciation of english words in german sentences .

erogol · September 11, 2020, 9:52am

I forgot the PWGAN training before my PTO and now there is a model trained for 2750000 iterations. I’d assume that should be somehow better than the previous model
https://drive.google.com/drive/u/1/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz

TheDayAfter · September 12, 2020, 6:35am

Better in which sense @erogol? I do not see or hear any difference by replacing the vocoder model and configuration by the new ones, still using the German colab.

erogol · September 12, 2020, 7:16pm

Then maybe it is not better.

mrthorstenm · September 16, 2020, 8:39pm

We’re preparing a mos process to figure out which taco2 + pwgan checkpoint combinations sounds most promising. You can get an impression on our work-in-progress by listening to the following examples.

If you wanna help us listen to the following two soundcloud playlists:

Ert

Playlist Ert

Bernie

Playlist “Bernie”

Which version do you find better? “Bernie” or “Ert”. In addition any further feedback is welcome.

othiele · September 21, 2020, 8:21am

@erogol Is it possible for you to post some other versions of Thorsten’s vocoder training between 925k and 2.75m? Maybe 1.5m and 2m as the 2.75m version sounds worse, but maybe a step in between is even better than the already good 925k one

TheDayAfter · September 21, 2020, 7:02pm

I am really impressed by the Mycroft teaser

https://soundcloud.com/thorsten-mueller-395984278/sets

and Gothic TTS

https://soundcloud.com/sanjaesc-395770686/sets

erogol · September 22, 2020, 1:47pm

I’ve uploaded them https://drive.google.com/drive/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz?usp=sharing

sorry for the late return

othiele · September 22, 2020, 2:59pm

Thanks @erogol, the 1.5m checkpoint is already a lot worse than the 925k one. Would it be possible for you to upload the updated events file for Tensorboard? I would take a look and maybe we can guess what checkpoint might be best between 925k and 1.5m and have some hints for future trainings?