Contributing my german voice for tts

erogol · September 11, 2020, 9:52am

I forgot the PWGAN training before my PTO and now there is a model trained for 2750000 iterations. I’d assume that should be somehow better than the previous model
https://drive.google.com/drive/u/1/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz

TheDayAfter · September 12, 2020, 6:35am

Better in which sense @erogol? I do not see or hear any difference by replacing the vocoder model and configuration by the new ones, still using the German colab.

erogol · September 12, 2020, 7:16pm

Then maybe it is not better.

mrthorstenm · September 16, 2020, 8:39pm

We’re preparing a mos process to figure out which taco2 + pwgan checkpoint combinations sounds most promising. You can get an impression on our work-in-progress by listening to the following examples.

If you wanna help us listen to the following two soundcloud playlists:

Ert

Playlist Ert

Bernie

Playlist “Bernie”

Which version do you find better? “Bernie” or “Ert”. In addition any further feedback is welcome.

othiele · September 21, 2020, 8:21am

@erogol Is it possible for you to post some other versions of Thorsten’s vocoder training between 925k and 2.75m? Maybe 1.5m and 2m as the 2.75m version sounds worse, but maybe a step in between is even better than the already good 925k one

TheDayAfter · September 21, 2020, 7:02pm

I am really impressed by the Mycroft teaser

https://soundcloud.com/thorsten-mueller-395984278/sets

and Gothic TTS

https://soundcloud.com/sanjaesc-395770686/sets

erogol · September 22, 2020, 1:47pm

I’ve uploaded them https://drive.google.com/drive/folders/1ks0pijycnX_JfpFwXoWJGrWrvml7ajGz?usp=sharing

sorry for the late return

othiele · September 22, 2020, 2:59pm

Thanks @erogol, the 1.5m checkpoint is already a lot worse than the 925k one. Would it be possible for you to upload the updated events file for Tensorboard? I would take a look and maybe we can guess what checkpoint might be best between 925k and 1.5m and have some hints for future trainings?

mrthorstenm · September 23, 2020, 4:08pm

According to evolution of dataset (https://github.com/thorstenMueller/deep-learning-german-tts/blob/master/EvolutionOfThorstenDataset.pdf) i published a csv file listing which recordings were done in which phase (different quality levels).
So if anyone would just use a special phase you can now take a look on which files belong to it.

See (chapter dataset evolution) here:

TheDayAfter · October 18, 2020, 6:41pm

@dkreutz I turned on CUDA

And did some some further tests using Mozilla-TTS and different other TTS repos. Latter on an English dataset:

Retesting the performance of Mozilla TTS on a PC with quadcore i5 CPUs with CUDA off: I was not aware that running inference again and again on different text inputs converges the RTF:

sentence = “Nach dem Zusammenbruch des Römischen Reiches drangen Alamannen und Burgunden in das Gebiet ein.”
align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)
(134144,)

Run-time: 12.397961854934692
Real-time factor: 2.037906188176561
Time per step: 9.242207778774145e-05
Testing Mozilla TTS on a Jetson Nano with CUDA on:

sentence = “Nach dem Zusammenbruch des Römischen Reiches drangen Alamannen und Burgunden in das Gebiet ein.”

align, spec, stop_tokens, wav = tts(model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

(134144,)

Run-time: 8.102147102355957
Real-time factor: 1.33133003504179
Time per step: 6.037827702025876e-05

So far I get the best results using Fastspeech(2) + a MelGAN vocoder which runs about realtime on the Jetson Nano. On CPU there are some TTS models + MelGAN providing a RTF of about 0.25 which are unfortunately not performant on ARM + GPUs.

Looking forward that @erogol, @mrthorstenm and maybe @sanjaesc push the German TTS topic.

erogol · October 19, 2020, 1:27pm

it is really interesting to see the comparison. Could you pls clarify what models did you use specifically? Is it Tacotron2+MelGAN?

You can maybe try the glow-tts model which is only English atm. However, it’d show you what is possible with that model wrt run-time. It ought to be faster than real-time.

TheDayAfter · October 19, 2020, 4:09pm

Yes, i used Tacotron2 DDC + MelGAN for the comparison.

I will try out Glow TTS and report the outcome. Thank you for the hint @erogol.

TheDayAfter · October 22, 2020, 8:15am

I have issues to run Glow TTS on my Jetson Nano as of

MAGMA library not found in compilation. Please rebuild with MAGMA.

On my PC the runtime is quite good, RTF of about 0.25 without CUDA but the produced audio quality is poor with this model in my opinion.

p.s. :s/pure/poor

dkreutz · October 21, 2020, 9:39pm

You might want to look here: Nvidia-MAGMA and ICL/UTK-website

nmstoker · October 22, 2020, 12:55am

Is that an autocorrect/typo for “poor” where it says “pure”?

TheDayAfter · October 22, 2020, 8:16am

Thank you Neil, it was a typo.

TheDayAfter · October 22, 2020, 8:19am

I compiled Magma which took several hours on the Jetson Nano and included the Magma path to LD_LIBRARY but it seems not to be considered. As of the generated “poor” audio quality i will not put further efforts into this.

mrthorstenm · October 22, 2020, 8:32am

Thanks @TheDayAfter for your research. Do i summarize this correct when i say - GlowTTS isn’t currently worth the effort based on resulting audio output?

dkreutz · October 22, 2020, 9:32am

I suggest to discuss this specific problem in the Nvidia-Jetson forums

TheDayAfter · October 22, 2020, 10:24am

Yes, but in general a subjective topic You can check for instance
GlowTTS-Colab