Contributing my german voice for tts

For real-time we need to train MultiBand-MelGAN as the vocoder. Then we can run it real-time on CPU. MB-MelGAN + Taco2 is 1.45 real-time factor on Raspi 3 and slightly faster if you use TFLite.

The PR merged dealing with the flags option.

1 Like

So, you defined “remove-flags” as default. I initially though keeping these flags would make pronounciation better for german sentences with “english” words in it. So will english words be pronounced “in english” without the flags?

eg.: “Wie kann man den Song so verschandeln?”
viː kan man deːn (en)sɒŋ(de) zoː fɛɾʃandəln

yes it’s supposed to be like that :slight_smile:

1 Like

So we’ll run next training without the flags and check pronounciation of english words in german sentences :wink: .

2 Likes

I forgot the PWGAN training before my PTO and now there is a model trained for 2750000 iterations. I’d assume that should be somehow better than the previous model
https://drive.google.com/drive/u/1/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz

3 Likes

Better in which sense @erogol? I do not see or hear any difference by replacing the vocoder model and configuration by the new ones, still using the German colab.

Then maybe it is not better.

We’re preparing a mos process to figure out which taco2 + pwgan checkpoint combinations sounds most promising. You can get an impression on our work-in-progress by listening to the following examples.

If you wanna help us listen to the following two soundcloud playlists:

Ert

Playlist Ert

Bernie

Playlist “Bernie”

Which version do you find better? “Bernie” or “Ert”. In addition any further feedback is welcome.

3 Likes

@erogol Is it possible for you to post some other versions of Thorsten’s vocoder training between 925k and 2.75m? Maybe 1.5m and 2m as the 2.75m version sounds worse, but maybe a step in between is even better than the already good 925k one :slight_smile:

2 Likes

I am really impressed by the Mycroft teaser

https://soundcloud.com/thorsten-mueller-395984278/sets

and Gothic TTS

https://soundcloud.com/sanjaesc-395770686/sets

2 Likes

I’ve uploaded them https://drive.google.com/drive/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz?usp=sharing

sorry for the late return

2 Likes

Thanks @erogol, the 1.5m checkpoint is already a lot worse than the 925k one. Would it be possible for you to upload the updated events file for Tensorboard? I would take a look and maybe we can guess what checkpoint might be best between 925k and 1.5m and have some hints for future trainings?

According to evolution of dataset (https://github.com/thorstenMueller/deep-learning-german-tts/blob/master/EvolutionOfThorstenDataset.pdf) i published a csv file listing which recordings were done in which phase (different quality levels).
So if anyone would just use a special phase you can now take a look on which files belong to it.

See (chapter dataset evolution) here:

3 Likes

@dkreutz I turned on CUDA :slight_smile:

And did some some further tests using Mozilla-TTS and different other TTS repos. Latter on an English dataset:

  1. Retesting the performance of Mozilla TTS on a PC with quadcore i5 CPUs with CUDA off: I was not aware that running inference again and again on different text inputs converges the RTF:

    sentence = “Nach dem Zusammenbruch des Römischen Reiches drangen Alamannen und Burgunden in das Gebiet ein.”
    align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)
    (134144,)

    Run-time: 12.397961854934692
    Real-time factor: 2.037906188176561
    Time per step: 9.242207778774145e-05

  2. Testing Mozilla TTS on a Jetson Nano with CUDA on:

    sentence = “Nach dem Zusammenbruch des Römischen Reiches drangen Alamannen und Burgunden in das Gebiet ein.”

    align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

    (134144,)

    Run-time: 8.102147102355957
    Real-time factor: 1.33133003504179
    Time per step: 6.037827702025876e-05

So far I get the best results using Fastspeech(2) + a MelGAN vocoder which runs about realtime on the Jetson Nano. On CPU there are some TTS models + MelGAN providing a RTF of about 0.25 which are unfortunately not performant on ARM + GPUs.

Looking forward that @erogol, @mrthorstenm and maybe @sanjaesc push the German TTS topic.

it is really interesting to see the comparison. Could you pls clarify what models did you use specifically? Is it Tacotron2+MelGAN?

You can maybe try the glow-tts model which is only English atm. However, it’d show you what is possible with that model wrt run-time. It ought to be faster than real-time.

1 Like

Yes, i used Tacotron2 DDC + MelGAN for the comparison.

I will try out Glow TTS and report the outcome. Thank you for the hint @erogol.

I have issues to run Glow TTS on my Jetson Nano as of

MAGMA library not found in compilation. Please rebuild with MAGMA.

On my PC the runtime is quite good, RTF of about 0.25 without CUDA but the produced audio quality is poor with this model in my opinion.

p.s. :s/pure/poor

You might want to look here: Nvidia-MAGMA and ICL/UTK-website

1 Like

Is that an autocorrect/typo for “poor” where it says “pure”?

1 Like