Contributing my german voice for tts

Is there an updated version of the notebook somewhere? I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-22e1b9432cf3> in <module>
     26 
     27         mask = sequence_mask(text_lengths)
---> 28         mel_outputs, postnet_outputs, alignments, stop_tokens = model.forward(text_input, text_lengths, mel_input)
     29 
     30         # compute loss

ValueError: too many values to unpack (expected 4)

the forward function returns 6 values. Just add

mel_outputs, postnet_outputs, alignments, stop_tokens, decoder_outputs_backward, alignments_backward = model.forward(text_input, text_lengths, mel_input, speaker_ids=speaker_ids)

Thanks, it worked :smiley: I had to remove speaker_ids=speaker_ids

Do you have any idea why I am getting a mismatch between the alignments and the wavs? The extraction was okay, but when I try to train I get

assert mel.shape[-1] * self.hop_len == audio.shape[-1], f' [!] {mel.shape[-1] * self.hop_len} vs {audio.shape[-1]}
AssertionError:  [!] 104960 vs 104750

I did check the configs and everything looks the same… weird

Did you set trim silence to false?

Yes, disabled in both configs

maybe best to open a new issue on github then or ask in the issue you linked before.

I listen the SpeedySpeech sample, but it doesn’t even read the whole thing right. And the voice quality is not better than Glow-TTS, at least to my ear.

BTW if anyone is willing to take on SpeedySpeech, we can work on that together. I’d be a nice addition to the repo

1 Like

One important thing to note.

Vocoder models are not fully optimized. I think there is a big perf gap that we can improve with a good hyper parameter search.

Another option is to use model predictions to train the model. In general I observe better results that way .

You can find a list of about twenty audiosamples generated by the SpeedySpeech model and comparisions between SpeedySpeech + MelGAN vs Tacotron2 + MelGAN vs Ground truth here https://janvainer.github.io/speedyspeech/

Speedyspeech is definitely not perfect but it looks like a good compromise between quality and performance, especially when you have CPUs available only. Glow-TTS sounds too unnatural and metallic for my ears.

Finetuning the models may improve both, quality and performance.

Just a short update.

Due the many possible combinations of TTS training and vocoder i’m currently in discussion with @dkreutz and @othiele if we should write a roadmap on which combinations to try with german “thorsten” dataset.

Personally i’m in contact with @synesthesiam who’s kindly training GlowTTS + mb melgan on the dataset (rhasspy and home assistant). He uploaded some samples “training in progress” (link can be found here: https://github.com/thorstenMueller/deep-learning-german-tts/issues/10)

Additionally i try setting up a WaveGrad vocoder training based on this repo https://github.com/ivanvovk/WaveGrad currently.

2 Likes

Good! This fork here already has support for Mozilla TTS spectrogram :slight_smile: https://github.com/freds0/wavegrad you train either with GT spectrograms or preprocessing script and then you save the spectrogram and load it there.

3 Likes

Hello dear TTS-fellowers.

We can celebrate birthday today because the first post in this thread has been written on november 5th, 2019 - so exact one year ago :slight_smile: - and no end in sight.

I just wanna say thank you to all of you guys (list of all named would propably be too long) who inspired, followed, motivated, helped and supported me on this journey.

So a huge round of applause for a nice community.

with all the best wishes for all of you
Thorsten

4 Likes

For those who want something pre-packaged, I have a GlowTTS/Multi-band MelGAN combo trained from @mrthorstenm’s dataset available here: https://github.com/rhasspy/de_larynx-thorsten

You can run a Docker container:

$ docker run -it -p 5002:5002 \
      --device /dev/snd:/dev/snd \
      rhasspy/larynx:de-thorsten-1

and then visit http://localhost:5002 for a test page. There’s a /api/tts endpoint available, and it even mimics the MaryTTS API (/process) so you can use it in any system that supports MaryTTS.

If you happen to use Home Assistant, there’s a Hass.io add-on available as well :slight_smile:

4 Likes

Many thanks for this post and great to see Thorsten’s voice used in different projects though i do not like the results of Glow-TTS in general.

Your trained model has the typical issue with German umlauts like other trained models too, see for instance umlaut issue.

Can I ask if you guys modify the phoneme_cleaner function to not perform transliteration? That is, special characters like ä to a and so on. It is strange you have problems with umlaut characters. I also train on a language with special characters and have no problems with pronunciation :slight_smile: However, I got better results once I implemented a pronunciation dictionary and espeak-ng. Worth to try. :slight_smile:

1 Like

I use a separate tool called gruut that uses a pronunciation dictionary and grapheme-to-phoneme model to generate IPA. It correctly produces /k œ n ə n/ for können, so I’m not sure why the problem persists.

Maybe I’m just hitting a limitation of GlowTTS?

@synesthesiam and i discussed this weird umlaut issue here https://github.com/thorstenMueller/deep-learning-german-tts/issues/10#issuecomment-716823273 .

In our small tts group @repodiac is the guy with most experience in german phoneme cleaning - maybe he can support on this.

1 Like

Turns out my problem was the phoneme cleaners! I just switched it to “no_cleaners”, and it seems to have been fixed (to my untrained ear): https://github.com/rhasspy/de_larynx-thorsten/blob/master/samples/Können_Sie_bitte_langsamer_sprechen.wav

I’ve updated the Docker images :slight_smile:

3 Likes

Hi @synesthesiam.
Thanks for the update. The umlaut in “Können” is now pronounced much better and is good to understand (to my german trained ear :wink: )

1 Like