Contributing my german voice for tts

georroussos · October 22, 2020, 3:03pm

Is there an updated version of the notebook somewhere? I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-22e1b9432cf3> in <module>
     26 
     27         mask = sequence_mask(text_lengths)
---> 28         mel_outputs, postnet_outputs, alignments, stop_tokens = model.forward(text_input, text_lengths, mel_input)
     29 
     30         # compute loss

ValueError: too many values to unpack (expected 4)

sanjaesc · October 22, 2020, 3:10pm

the forward function returns 6 values. Just add

mel_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward = model.forward(text_input, text_lengths, mel_input, speaker_ids=speaker_ids)

georroussos · October 22, 2020, 5:12pm

Thanks, it worked I had to remove speaker_ids=speaker_ids

georroussos · October 22, 2020, 9:45pm

Do you have any idea why I am getting a mismatch between the alignments and the wavs? The extraction was okay, but when I try to train I get

assert mel.shape[-1] * self.hop_len == audio.shape[-1], f' [!] {mel.shape[-1] * self.hop_len} vs {audio.shape[-1]}
AssertionError:  [!] 104960 vs 104750

I did check the configs and everything looks the same… weird

sanjaesc · October 23, 2020, 7:29am

Did you set trim silence to false?

georroussos · October 23, 2020, 9:17am

Yes, disabled in both configs

sanjaesc · October 23, 2020, 9:38am

maybe best to open a new issue on github then or ask in the issue you linked before.

erogol · October 25, 2020, 1:21am

I listen the SpeedySpeech sample, but it doesn’t even read the whole thing right. And the voice quality is not better than Glow-TTS, at least to my ear.

BTW if anyone is willing to take on SpeedySpeech, we can work on that together. I’d be a nice addition to the repo

erogol · October 25, 2020, 1:23am

One important thing to note.

Vocoder models are not fully optimized. I think there is a big perf gap that we can improve with a good hyper parameter search.

Another option is to use model predictions to train the model. In general I observe better results that way .

TheDayAfter · October 25, 2020, 8:29am

You can find a list of about twenty audiosamples generated by the SpeedySpeech model and comparisions between SpeedySpeech + MelGAN vs Tacotron2 + MelGAN vs Ground truth here https://janvainer.github.io/speedyspeech/

Speedyspeech is definitely not perfect but it looks like a good compromise between quality and performance, especially when you have CPUs available only. Glow-TTS sounds too unnatural and metallic for my ears.

Finetuning the models may improve both, quality and performance.

mrthorstenm · October 27, 2020, 7:21pm

Just a short update.

Due the many possible combinations of TTS training and vocoder i’m currently in discussion with @dkreutz and @othiele if we should write a roadmap on which combinations to try with german “thorsten” dataset.

Personally i’m in contact with @synesthesiam who’s kindly training GlowTTS + mb melgan on the dataset (rhasspy and home assistant). He uploaded some samples “training in progress” (link can be found here: https://github.com/thorstenMueller/deep-learning-german-tts/issues/10)

Additionally i try setting up a WaveGrad vocoder training based on this repo https://github.com/ivanvovk/WaveGrad currently.

georroussos · October 27, 2020, 9:45pm

Good! This fork here already has support for Mozilla TTS spectrogram https://github.com/freds0/wavegrad you train either with GT spectrograms or preprocessing script and then you save the spectrogram and load it there.

mrthorstenm · November 5, 2020, 11:58am

Hello dear TTS-fellowers.

We can celebrate birthday today because the first post in this thread has been written on november 5th, 2019 - so exact one year ago - and no end in sight.

I just wanna say thank you to all of you guys (list of all named would propably be too long) who inspired, followed, motivated, helped and supported me on this journey.

So a huge round of applause for a nice community.

with all the best wishes for all of you
Thorsten

synesthesiam · November 13, 2020, 10:19pm

For those who want something pre-packaged, I have a GlowTTS/Multi-band MelGAN combo trained from @mrthorstenm’s dataset available here: https://github.com/rhasspy/de_larynx-thorsten

You can run a Docker container:

$ docker run -it -p 5002:5002 \
      --device /dev/snd:/dev/snd \
      rhasspy/larynx:de-thorsten-1

and then visit http://localhost:5002 for a test page. There’s a /api/tts endpoint available, and it even mimics the MaryTTS API (/process) so you can use it in any system that supports MaryTTS.

If you happen to use Home Assistant, there’s a Hass.io add-on available as well

TheDayAfter · November 14, 2020, 3:14pm

Many thanks for this post and great to see Thorsten’s voice used in different projects though i do not like the results of Glow-TTS in general.

Your trained model has the typical issue with German umlauts like other trained models too, see for instance umlaut issue.

georroussos · November 14, 2020, 3:36pm

Can I ask if you guys modify the phoneme_cleaner function to not perform transliteration? That is, special characters like ä to a and so on. It is strange you have problems with umlaut characters. I also train on a language with special characters and have no problems with pronunciation However, I got better results once I implemented a pronunciation dictionary and espeak-ng. Worth to try.

synesthesiam · November 14, 2020, 4:24pm

I use a separate tool called gruut that uses a pronunciation dictionary and grapheme-to-phoneme model to generate IPA. It correctly produces /k œ n ə n/ for können, so I’m not sure why the problem persists.

Maybe I’m just hitting a limitation of GlowTTS?

mrthorstenm · November 14, 2020, 4:41pm

@synesthesiam and i discussed this weird umlaut issue here https://github.com/thorstenMueller/deep-learning-german-tts/issues/10#issuecomment-716823273 .

In our small tts group @repodiac is the guy with most experience in german phoneme cleaning - maybe he can support on this.

synesthesiam · November 15, 2020, 4:08pm

Turns out my problem was the phoneme cleaners! I just switched it to “no_cleaners”, and it seems to have been fixed (to my untrained ear): https://github.com/rhasspy/de_larynx-thorsten/blob/master/samples/Können_Sie_bitte_langsamer_sprechen.wav

I’ve updated the Docker images

mrthorstenm · November 15, 2020, 5:02pm

Hi @synesthesiam.
Thanks for the update. The umlaut in “Können” is now pronounced much better and is good to understand (to my german trained ear )

Topic		Replies	Views
Data and training considerations to improve voice naturalness TTS (Text-to-Speech)	32	4439	November 11, 2019
Train Multispeaker Dataset + WaveRNN TTS (Text-to-Speech)	50	5765	October 5, 2020
Training russian TTS TTS (Text-to-Speech)	9	7159	March 11, 2021
Multispeaker development progress TTS (Text-to-Speech)	30	2953	May 31, 2020
Audio generated with TTS is a Bip TTS (Text-to-Speech) learning	4	2147	March 10, 2021

Contributing my german voice for tts

Related topics