Contributing my german voice for tts

erogol · December 6, 2020, 3:15am

try 50 iterations. It should be fine by now. In training it only uses 12 iterations I guess for faster runtime.

TheDayAfter · December 9, 2020, 8:07pm

Great hint. Current published training models are about 97k steps for Tacotron2 and 800K steps for MB-Melgan. As you said there is room for improvements.

p.s. convergence is expected around 100K and 950K steps respectively.

mrthorstenm · December 9, 2020, 10:12pm

After a short discussion in Mycroft chat i uploaded some new samples on current wavegrad training.

Some facts:

existing taco2 model of my dataset (460k) trained by @othiele
wavegrad model training currently running (right now at 350k steps)
tune_wavegrad still pending for getting noise schedule

I’ll keep wavegrad training running up to 500k and pause then for running tune_wavegrad.

TheDayAfter · December 13, 2020, 8:33am

Nice job, eager to see the final results. I am sure you will add same sentence examples to your overview page once finished for direct comparison https://thorstenmueller.github.io/deep-learning-german-tts/audio_compare

mrthorstenm · December 13, 2020, 12:04pm

Thanks @TheDayAfter.
Currently i’m playing around with WaveGrad and noise_schedule. Based on these vocder testresults i’ll continue wavegrad or taco2 model training.

If there’s something new to show, i’ll publish it on my comparison page.

mrthorstenm · December 14, 2020, 5:00pm

Thanks to @sanjaesc i was able to run tune_wavegrad for noise scheduling. It’s still work to do, but it’s getting slightly better.
I’ve uploaded some samples on my comparison page:

And a short one here:

Generation time (tested on cpu) isn’t really fast.

Run-time: 81.68155550956726
Real-time factor: 9.704083679886214
Time per step: 0.00044009456989066355

What do you think should be next:
a) More training on taco2 model
b) More training on wavegrad vocoder

mrthorstenm · December 19, 2020, 9:45am

Hey guys.

Beside continue existing training efforts i’ve thougth for some time on going back to the microphone. Why, you ask?

Because of the paper: Exploring Transfer Learning for Low ResourceEmotional TTS

For this “emotional” recordings are required. When i understand it right following categories are required:

Amused
Angry
Disgusted
Neutral
Sleepy

But i didn’t find any german phrases or corpus in cc0 license which i could use. But before i start collecting sentences by myself i wanted to hear what you think on that.

So, what do you think on that?

mrthorstenm · December 21, 2020, 9:50am

When i see it right there’s no need for a special “emotional” corpus. I can reuse existing phrases and just pronounce them in an emotional way. This would surely be easier if the text is emotional by itself, but will work with all phrases.

So next step would be taking random 300 phrases and record these phrases in four different emotions. I just need to borrow mic equipment again.

TheDayAfter · December 21, 2020, 10:12am

This is interesting as the user monatis from https://github.com/monatis indicated that emotional speak like the one found in audio books in general can have negative effects on the generated TTS but i assume that putting emotions on top of a well balanced non-emotional dataset might be a cool idea and worth the additional effort.

mrthorstenm · December 21, 2020, 11:15am

That’s true and it’s described in a paper i’ve read on this topic. Actors or “professional voices” for audio books pronounce too emotional. So i think it could work reading non-emotional phrases with a normal level of emotions.

I’ll start recording in mid januar and publish first recordings to get feedback from you guys.

mrthorstenm · December 24, 2020, 10:13pm

Hey guys.

I wish all of you “merry christmas” and some relaxing days.

Best
Thorsten

mrthorstenm · January 4, 2021, 6:42pm

Hello.

I hope all of you had some relaxing days and a good (and healthy) start into 2021.

Just in case it’s interesting for you. Our “thorsten” dataset is now available on openslr.org too.

http://openslr.org/95/

mrthorstenm · January 23, 2021, 5:53pm

Hello.
The nice guy @sanjaesc experimented recently with my dataset and sent me some samples with HifiGAN vocoder which i think are quite useable .

The breathing is nice, but it’s too often - so i think i’d it’s better without breezing in this speaking speed.

Soundcloud Playlist HifiGAN

What do you think?

Thanks for your great support

TheDayAfter · January 24, 2021, 9:02am

Hello @mrthorstenm, @sanjaesc Thank you for the update on HifiGAN.

Currently HifiGAN is my personal favorite on your vocoder comparison page in respect to the combination of interference speed and voice quality. I am waiting for the final results of @monatis regarding Mulitband MelGAN.

Regarding the breathing: I like it in short sentences as it gives your voice an additional natural touch. In long sentences it seems too much, never reflected about how many breaths we take when speaking

p.s. Is the HifiGAN model already available somewhere so that we can “play” with it?

mrthorstenm · February 11, 2021, 6:16am

I’ve just added version 3 of my “thorsten” dataset. It’s based on v02, but speed has been increased by 10%. Trained TTS models will generate a little faster (but still natural) speechflow.

mrthorstenm · March 18, 2021, 6:28pm

Recording my emotional dataset is finished .

It took longer and was difficulter to pronounce emotional on non emotional (or wrong emotional) phrases but it’s done.
Now @dkreutz is doing his audio optimization magic. One he’s done i’ll publish the “Thorsten emotional dataset”.

Always keep in mind that i’m no professional voice actor, just a normal guy contributing his voice.

Details and an audio sample can be found here:

mrthorstenm · March 23, 2021, 6:51pm

Just in case it’s interesting for you. I’ve created a Twitteraccount for my german voice contribution where i plan to post new models, news or updates around “Thorsten” dataset.

https://twitter.com/ThorstenVoice

mrthorstenm · April 3, 2021, 1:26pm

Hey guys.

@Erogol from Coqui released my first trained open german TTS model .

It consists of:

Tacotron2 DCA model (based on “Thorsten” dataset)
WaveGrad vocoder

WaveGrad vocoder has a bad real time factor on cpu and an acceptable rtf on cuda. Next i’m training a Fullband-MelGAN vocoder for getting a better rtf to work with Mycroft voice assistant.

Want to give it a try?

pip install -U tts
tts --model_name tts_models/de/thorsten/tacotron2-DCA --text "Was geht, was geht, ich sags dir ganz konkret." --use_cuda=true

For updates on new models check my twitter account (https://twitter.com/ThorstenVoice).

Thank you all guys for your great support on this

Happy easter holidays

mrthorstenm · April 4, 2021, 11:12am

Hello.

I’ve just released my open german “emotional” dataset .
For details on dataset, audio samples and download link visit my github page:

https://twitter.com/ThorstenVoice

I hope it’s useful for someone and (as always) please keep in mind, that i’m no professional voice talent, just a guy contributing his voice .

Wishing you nice easter holidays

mrthorstenm · April 23, 2021, 5:07am

My “open german voice dataset” has now an article on german wikipedia .

It’s a great journey together with you guys