Inspired by the tts audio sample comparison pages from @edresson1 and @erogol i published some audio samples our group (still some work to do) on a simple webpage.
This is great, thank you! I forgot to add this to my page: my GlowTTS model was trained for 380K steps and the vocoder was trained for 500k steps.
Also, I think sample 5 for the ParallelWaveGAN does not match the text.
Hi @synesthesiam.
Thanks for your nice feedback. I added details on training steps and audiosample #5 to my site.
Additionally i changed text from “Anfang vom Froschkönig” (translated to: beginning of a german fairy story) to the real spoken phrase.
Thanks for your feedback - it makes much more sense now.
Just for general information.
User monatis from TensorSpeech / TensorFlowTTS repo is training a model based on my public dataset.
Notebook, details and (work in progress) samples can be found here.
I’ve added some samples on my vocoder comparison page.
Thank you Thorsten.
I gave de_larynx-thorsten a try but was not happy with the final result. At least i would not use it as a German TTS backend for myself. I don’t now if its GlowTTS or Melgan in general. On the other hand, i like that it includes a TTS server by default which turns to be very useful.
We’re aware that there is lot of room for quality improvements, but we’re still in training-mode.
@dkreutz is training a VocGAN model and i’m training (due great support from @sanjaesc ) a WaveGrad model. We’ll see if quality can improve against previous models.
Hey guys.
Thanks to @nmstoker i tried dev tensorboard and share current wavegrad training with “thorsten” dataset.
Audio has still random noise in background, but i hope it will get less on more training or is this a good point to use tune_wavegrad.py?
try 50 iterations. It should be fine by now. In training it only uses 12 iterations I guess for faster runtime.
Great hint. Current published training models are about 97k steps for Tacotron2 and 800K steps for MB-Melgan. As you said there is room for improvements.
p.s. convergence is expected around 100K and 950K steps respectively.
After a short discussion in Mycroft chat i uploaded some new samples on current wavegrad training.
Some facts:
- existing taco2 model of my dataset (460k) trained by @othiele
- wavegrad model training currently running (right now at 350k steps)
- tune_wavegrad still pending for getting noise schedule
I’ll keep wavegrad training running up to 500k and pause then for running tune_wavegrad.
Nice job, eager to see the final results. I am sure you will add same sentence examples to your overview page once finished for direct comparison https://thorstenmueller.github.io/deep-learning-german-tts/audio_compare
Thanks @TheDayAfter.
Currently i’m playing around with WaveGrad and noise_schedule. Based on these vocder testresults i’ll continue wavegrad or taco2 model training.
If there’s something new to show, i’ll publish it on my comparison page.
Thanks to @sanjaesc i was able to run tune_wavegrad for noise scheduling. It’s still work to do, but it’s getting slightly better.
I’ve uploaded some samples on my comparison page:
And a short one here:
Generation time (tested on cpu) isn’t really fast.
Run-time: 81.68155550956726
Real-time factor: 9.704083679886214
Time per step: 0.00044009456989066355
What do you think should be next:
a) More training on taco2 model
b) More training on wavegrad vocoder
Hey guys.
Beside continue existing training efforts i’ve thougth for some time on going back to the microphone. Why, you ask?
Because of the paper: Exploring Transfer Learning for Low ResourceEmotional TTS
For this “emotional” recordings are required. When i understand it right following categories are required:
- Amused
- Angry
- Disgusted
- Neutral
- Sleepy
But i didn’t find any german phrases or corpus in cc0 license which i could use. But before i start collecting sentences by myself i wanted to hear what you think on that.
So, what do you think on that?
When i see it right there’s no need for a special “emotional” corpus. I can reuse existing phrases and just pronounce them in an emotional way. This would surely be easier if the text is emotional by itself, but will work with all phrases.
So next step would be taking random 300 phrases and record these phrases in four different emotions. I just need to borrow mic equipment again.
This is interesting as the user monatis from https://github.com/monatis indicated that emotional speak like the one found in audio books in general can have negative effects on the generated TTS but i assume that putting emotions on top of a well balanced non-emotional dataset might be a cool idea and worth the additional effort.
That’s true and it’s described in a paper i’ve read on this topic. Actors or “professional voices” for audio books pronounce too emotional. So i think it could work reading non-emotional phrases with a normal level of emotions.
I’ll start recording in mid januar and publish first recordings to get feedback from you guys.
Hey guys.
I wish all of you “merry christmas” and some relaxing days.
Best
Thorsten
Hello.
I hope all of you had some relaxing days and a good (and healthy) start into 2021.
Just in case it’s interesting for you. Our “thorsten” dataset is now available on openslr.org too.
Hello.
The nice guy @sanjaesc experimented recently with my dataset and sent me some samples with HifiGAN vocoder which i think are quite useable .
The breathing is nice, but it’s too often - so i think i’d it’s better without breezing in this speaking speed.
What do you think?
Thanks for your great support