Contributing my german voice for tts

Hi Thorsten, thanks for your updates.

I’ve been training a Belgian Dutch multi speaker model for 170k iterations until it stopped improving. The model is okay for some sentences but not consistent enough to be usable, so inspired by you I’m considering to create a single speaker database as well.

I have some questions:

  • How long does it take to record one hour of audio? What is the longest recording session you’ve had so far?
  • Why does the result of your model deviate so strongly from the sample provided here? The quality of your dataset is similar to that of LJSpeech, or am I mistaken?

Finally, do you have any other tips to share regarding the creation of the dataset?

1 Like

Welcome @rdh.

I’m happy that I could inspire you recording your own dataset.

Recording time for one hour pure audio isn’t easy to guess since it depends on several aspects.
Average phrase length, recording method (which tool), postprocessing tasks (optimizing, silence trimming, …).

I recorded (and still record) phrases between 2 and 180 chars. Average length (currently) is 50 chars and i record 14 chars per second. Longest recording sessions might be 60 minutes without a break, mostly around 30 minutes (max).

Additional helpful informations are here:

3 Likes

Hey @mrthorstenm,

Thanks so much for your work and the interesting discussion here!

Please let me know if you’d need some GPU help training Tacotron-2. I would be happy to help out with sharing pre-trained models and to experiment with this repo: https://github.com/Rayhane-mamah/Tacotron-2.

1 Like

New month - new update.
I’ve recorded 1,1k new phrases with a length between 100 and 180 chars. New recordings will be done based on needed phonemes, if neccesary.

After recordings i’ve written a document on the evolution of this dataset, including several graphs from dataset analysis notebooks. If you’re interested feel free to take a look
Evolution of thorsten dataset.pdf (507,0 KB) .

Currently our group (internally named on a “lord of the ring” quote: “The fellowership of the … free german tts model”) has started new training tests.

And as always - i’ll keep you updated :smile:.

I wish all of you a nice weekend.

6 Likes

How did you install espeak-ng? I am running Ubuntu 18.04 and have been trying for hours now to no avail – I have to install from source if I want to add custom words, right? I am trying the guide and I keep getting espeak-ng: symbol lookup error: espeak-ng: undefined symbol: espeak_ng_SetVoiceByFile

Hi George,

I’m on Arch, and initially I had installed it from the package here: https://www.archlinux.org/packages/community/x86_64/espeak-ng/ but then when I decided I wanted to try customising certain words, I went ahead with installing it from source.

It’s a while since I did it, but the instructions I followed were those written by Josh Meyer, here: http://jrmeyer.github.io/tts/2016/07/03/How-to-Add-a-Language-to-eSpeak-NG.html but with some reference to the repo itself. As I understand it, the repo has instructions based on what Josh had written up, so they’re pretty similar.

There’s someone else who posted a similar issue here: https://groups.io/g/espeak-ng/message/2637 and they also posted it in the discussion at the bottom of Josh’s instructions page mentioned above. Unfortunately they don’t list any resolution.

I see you also posted an issue in the repo. I doubt this will solve it but there are some useful pointers in this issue: https://github.com/espeak-ng/espeak-ng/issues/662 which may help rule things out.

Ultimately the problem looks like it’s connected to a change made to this file: https://github.com/espeak-ng/espeak-ng/blob/master/src/include/espeak-ng/espeak_ng.h
back in early 2019, whereby they added the symbol that’s complained about in the error message (espeak_ng_SetVoiceByFile):

As a short-term check I suppose you could try installing from the version of the source just prior to the commit where they added that symbol. That would give reassurance that it was that change at fault (or it would establish that it’s just a coincidence, but I doubt it, given the direct mention of the symbol name).

Kind regards,
Neil

Cheers Neil, thank you so much :grinning: You are always so kind and helpful. I spent the entire evening trying to get it to work but it did not work, but I just got it to start on my Mac and I am actually very happy, because it works; my problem is that Swedish has a lot of compound words and I guess espeak hasn’t had a Swede work on the rules, so compounds are always mispronounced. Then I saw you guys trying about dictionaries (which was an idea I initially was going to do away with, because I worked on them so much when I did concat TTS and I am sick of them :joy:) but now it looks like it is the only solution, because I really do not want to train a char-based TTS - I suspect it will only perform poorly and I really want the phonemes for out-of-domain words. I tried adding one now and it worked. I am also lucky because a few years ago the National Library of Oslo got a pronunciation dictionary and it is open source, so I get to use it :smiley: Now I will get my hands dirty and try to fix espeak-ng on Ubuntu too, to train using it.

1 Like

Hello again @mrthorstenm! Is there any way I could email you about some questions? I’m working on my master thesis about applying prosody controls to Tacotron-2 and I’d love to get in touch and discuss potentially using your dataset for experimentation. Thank you and all the best! :slight_smile:

Lately I have been working on improving the TTS performance in compound and unseen words, since it is a hit or miss, especially since you cannot dictate the stress and it is entirely up to Tacotron what it will learn as a linguistic feature. One of the problems I had was that, in short sentences with unseen words (1 or 2 words), the stopnet sometimes tripped. That was also the case with compound words. I found that incorporating a pronunciation lexicon improves pronunciation massively and helps with the stopnet. My guess is that a large pronunciation lexicon that covers a big portion of the words is consistent in the phonemic transcriptions, so when the TTS is trained on the phoneme sequences, it may be much easier for it, when in guessing it might guess different phonemes for compound words (because it has not seen them) and trip.

1 Like

Do you have some example of a tts snippet from your voice? Would be nice to know how Mozilla TTS works for german language. - I am new to Mozilla TTS and currently exploring the status quo.
Best Regards from Frankfurt

Hello @fabianbusch.
Currently there’s no model (or samples) available since our tacotron2 training is still running and we’re finetuning several parameters to figure out best configuration.

Dataset “thorsten” is now available for free to use german tts training.
See my github page for dataset details and download url.

Please read „special thanks“ section on github for a list of great supporters on this project. It’s a pleasure to work with you guys on a free german tts model.

6 Likes

This speech corpus seems very, very good; great job! Only wish there were corpora like this one for all other Germanic languages :smiley:

1 Like

I posted an article on why i’ve chosen to contribute my voice.

4 Likes

@mrthorstenm I started DDC training with your dataset. So far samples look quite good. I’ll share the model once it is finished. And probably I’ll share a recipe too.

2 Likes

Thanks @erogol.
Your support on this is very welcome :slight_smile:

Hello.

German umlauts and phoneme cleaner issues:
@erogol we had sometimes problems with german umlauts in model training before using german phoneme cleaner by @repodiac. Maybe it’s woth a look if you encounter umlaut problems too.

TTS recipe
I wrote a shell script for starting training including pretasks and @repodiac wrapped it into a docker image.

The training model produces very silent results. Have you experienced it before with your models?

Now I start to train it with “do_sound_norm: true” to normalize the sound level. Hopefully it’ll mitigate the problem.

Nice you have the shell script. Please send a PR and we take look at it together.

We know that some recordings are louder (not much) than others but we thought that this would be normalized during training - so setting this option to true seems to make sense.

@dkreutz sent me a sample in the past which starts with “normal” volume and in second part decreases volume to a lower level.
derkleineprinz.zip (492,4 KB)

Maybe @dkreutz can help on this volume issue.