Contributing my german voice for tts

Thank you Neil, it was a typo.

I compiled Magma which took several hours on the Jetson Nano and included the Magma path to LD_LIBRARY but it seems not to be considered. As of the generated “poor” audio quality i will not put further efforts into this.

Thanks @TheDayAfter for your research. Do i summarize this correct when i say - GlowTTS isn’t currently worth the effort based on resulting audio output?

1 Like

I suggest to discuss this specific problem in the Nvidia-Jetson forums

1 Like

Yes, but in general a subjective topic :slight_smile: You can check for instance
GlowTTS-Colab

A direct comparision between GlowTTS and Speedyspeech generated audio, inference speed on CPU is about the same:

GlowTTS

and

Speedyspeech

Does GlowTTS use MelGAN? It sounds good, except for the lower frequency, which as usual suffers. I tried some tests with a higher fmin but I did not notice any improvement.

1 Like

I though all *GAN vocoders produces a (more or less) metallic voice as we encountered in our PWGAN tests.

If GlowTTS is uses MelGAN vocoder i didn’t hear a metallic voice and maybe we should try MelGAN too instead of PWGAN.

Our samples can be found here:

1 Like

The vocoder config of Mozilla TTS colab notebooks refers to “fullband-melgan”.

MelGAN has big problems with breath effects. It sounds very jarring. It doesn’t fare well in vocal fry, either. It is very pronounced. I tried all variants and I couldn’t get rid of it – although, I have a slight suspicion the dataset is also to blame. I got much better results when I trained using the Nancy dataset and the speaker did not glottalize at all when recording the utterances.

2 Likes

In my opinion GlowTTS generated audio sounds more metallic than the one from Speedyspeech. Both use same underlying LJSpeech dataset and MelGAN as a vocoder.

1 Like

Maybe we (@dkreutz and @othiele) should train MelGAN vocoder with “thorsten” dataset (sounds little crazy when i call it this way :wink: ) to compare quality agains PWGAN (metallic vs. breath problems).

Breathing is much better with PWGAN for me, but MelGAN sounds a bit more natural when the TTS is pronouncing utterances. Also, the spectrogram mismatches are much more exasperated in PWGAN. I think it is a general problem with GANs and cannot be avoided, sadly.

Did someone try training on ground truth aligned spectrograms from the TTS model?

I think so, someone on the repo. I took a listen, but it was not very good – although the settings were a bit different, so I would be keen to try it with a less powerful discriminator; I want to train one more GAN. How do we extract the GT alignments?

Got a link?

You can use the notebook https://github.com/mozilla/TTS/blob/master/notebooks/ExtractTTSpectrogram.ipynb

I’ll try, although I think the problem is the spectrograms that Taco2 produces during test time and these are different. Thanks for the notebook :slight_smile: Here is the thread https://github.com/mozilla/TTS/issues/508

Seems like they didn’t activate/train the discriminator at all. So I’am not sure the results can give any information about the quality.

Not sure I understand ^^.

I thought the spectrograms Taco2 produces are inherently different/lack detail in comparison to ground truth. Is that not the case? Then I thought that was the mismatch and not the actual ground truth.

Well yeah the produced spectrograms will be different from the ground truth of the original data. But that’s the goal to produce ground truth spectrograms which are aligned to the TTS model to fit it better. Which was my understanding. :sweat_smile: