Contributing my german voice for tts

Hi guys, just to let you know if you are interested. Feel free to check out my recent upload of https://github.com/repodiac/espeak-ng_german_loan_words - it is a brief tutorial with code where you can automatically create an additional dictionary for espeak-ng with ~10k German loan words.

These may improve TTS preprocessing when using phonemes (because loan words are correctly pronounced then, not “german-ized”)

1 Like

@erogol @mrthorstenm Many thanks for computing respective donating. I have several questions:

  1. Below you see the performance on a PC with quadcore i5 CPUs turning CUDA off, two out of four cores seem to be used during computation

    sentence = “In Deutschland starben bislang zwar weniger Menschen an Covid-19 als etwa in Belgien oder Großbritannien. Eine neue Studie zeigt jedoch: Bei Patienten, die ins Krankenhaus mussten, sind die Verläufe überall ähnlich.”
    align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

     (333312,)
      > Run-time: 36.56012582778931
      > Real-time factor: 2.4186008842661977
      > Time per step: 0.00010968712780813468
    

How can i improve this?

  1. Can this kind of stuff run in realtime on a Jetson Nano?

  2. Remark: The word ‘mussten’ was pronounced as ‘müssten’ and ‘Verläufe’ as ‘Verlaufe’ :slight_smile:

1 Like
  1. to improve performance turn on CUDA :wink:
  2. @mrthorstenm experimented with synthesis on RPI3 but it is far from realtime, I ran older Taco2 release on Jetson Nano and RT-factor was 1:5 up to 1:10 (1sec audio required 5-10sec processing). In the meantime there is a Tensorflow version of Taco2. Using this and coverting model to Tflite may improve performance on SBCs like RPI and Nano.
  3. this is a known issue, unfortunately this model was trained with wrong phoneme cleaner configuration. A new model is in the works (no publishing date yet)
3 Likes

Here’s the post @dkreutz mentioned.

1 Like

I missed your post Dominik. Your numbers are not promising for using a Jetson Nano for TTS purposes like described here at least not for realtime applications.

Hello.

It’s time for another short update.
We’re currently preparing a new training run and figured out some “issues” with phoneme handling with mixed english/german wording.

This warning occurs quite often:

[WARNING] fount 1 utterances containing language switches on lines 1
[WARNING] extra phones may appear in the "de" phoneset
[WARNING] language switch flags have been kept (applying "keep-flags" policy)

We analyzed where these warnings come from and found out that our dataset (metadata.csv) contains several (408) phrases with non-native-german words, being common in german every day language.

Some examples:

  • server
  • opensource
  • song
  • chat
  • team
  • computer
  • party
  • cool

Just a few phoneme samples of default config (keep-flags):

  • Auf der Couch könnte sie es sich gemütlich machen.
    • aʊf dɛɾ (en)kaʊtʃ(de) kœntə ziː ɛs zɪç ɡəmyːtlɪç maxən
  • Wie kann man den Song so verschandeln?
    • viː kan man deːn (en)sɒŋ(de) zoː fɛɾʃandəln
  • Nicht alle Teenager sind so.
    • nɪçt alə (en)tiːneɪdʒə(de) zɪnt zoː
  • Währenddessen spricht sie mit ihrem Computer.
    • vɛːrəndɛsən ʃpɾɪçt ziː mɪt iːrəm (en)kəmpjuːtə(de)

Currently we’re in discussion if we should run training with default option “–language-switch keep-flags” (with these warning to be produced) or if we should run training with disabled phoneme usage in config file.

Wishing you all a nice weekend :slight_smile:

3 Likes

Hi @erogol
It’s a funny coincidence that your latest dev commit (https://github.com/mozilla/TTS/commit/4f3917b9a673a4039e577a8098f545978df5ea2f) matches our current group internal discussion on “keep-flags” :slight_smile:
(See above post)

1 Like

For real-time we need to train MultiBand-MelGAN as the vocoder. Then we can run it real-time on CPU. MB-MelGAN + Taco2 is 1.45 real-time factor on Raspi 3 and slightly faster if you use TFLite.

The PR merged dealing with the flags option.

1 Like

So, you defined “remove-flags” as default. I initially though keeping these flags would make pronounciation better for german sentences with “english” words in it. So will english words be pronounced “in english” without the flags?

eg.: “Wie kann man den Song so verschandeln?”
viː kan man deːn (en)sɒŋ(de) zoː fɛɾʃandəln

yes it’s supposed to be like that :slight_smile:

1 Like

So we’ll run next training without the flags and check pronounciation of english words in german sentences :wink: .

2 Likes

I forgot the PWGAN training before my PTO and now there is a model trained for 2750000 iterations. I’d assume that should be somehow better than the previous model
https://drive.google.com/drive/u/1/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz

3 Likes

Better in which sense @erogol? I do not see or hear any difference by replacing the vocoder model and configuration by the new ones, still using the German colab.

Then maybe it is not better.

We’re preparing a mos process to figure out which taco2 + pwgan checkpoint combinations sounds most promising. You can get an impression on our work-in-progress by listening to the following examples.

If you wanna help us listen to the following two soundcloud playlists:

Ert

Playlist Ert

Bernie

Playlist “Bernie”

Which version do you find better? “Bernie” or “Ert”. In addition any further feedback is welcome.

3 Likes

@erogol Is it possible for you to post some other versions of Thorsten’s vocoder training between 925k and 2.75m? Maybe 1.5m and 2m as the 2.75m version sounds worse, but maybe a step in between is even better than the already good 925k one :slight_smile:

2 Likes

I am really impressed by the Mycroft teaser

https://soundcloud.com/thorsten-mueller-395984278/sets

and Gothic TTS

https://soundcloud.com/sanjaesc-395770686/sets

2 Likes

I’ve uploaded them https://drive.google.com/drive/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz?usp=sharing

sorry for the late return

2 Likes

Thanks @erogol, the 1.5m checkpoint is already a lot worse than the 925k one. Would it be possible for you to upload the updated events file for Tensorboard? I would take a look and maybe we can guess what checkpoint might be best between 925k and 1.5m and have some hints for future trainings?