Contributing my german voice for tts

And @mrthorstenm I’m hoping the above may be some help to you too :slightly_smiling_face:

Thanks @nmstoker. Of course are your instructions helpful for me too.
I provided a cleaned uo dataset to @dkreutz who optimized the files on random noise and echo. So while i’m recording new sentences he is on processing/analysis the dataset.

Currently training is around step 39k and we have a few questions on interpreting the graphs (based on 20k training step).

Results from dataset analysis:
Bildschirmfoto 2020-02-13 um 21.12.36

Bildschirmfoto 2020-02-13 um 21.12.29

Bildschirmfoto 2020-02-13 um 21.12.22

Bildschirmfoto 2020-02-13 um 21.12.12(1)

  • Should we remove phrases longer than 125 from dataset?
  • Any ideas on the graphs?

Eval and training alignment graphs
Bildschirmfoto 2020-02-13 um 20.40.17

TrainingFigure graph looks “disrupted”. Is this okay?

Bildschirmfoto 2020-02-13 um 20.39.57

EvalFigures graph stops before reaching right upper corner. Is this okay?

CheckDatasetSNR (signal-to-noise ration)
Bildschirmfoto 2020-02-13 um 23.04.37(1)

Value of 100 should be best. So dataset has 5.000 recordings that have a great value.

General questions:

  • As far as i know we have to start a new training run if we remove or add files to the dataset or can we modify the model after training is finished?
1 Like

Should we remove phrases longer than 125 from dataset?

Assuming that those sentences have nothing wrong from a quality/consistency perspective, it might be better to keep them in the dataset and simply let the training code include/remove them based on the settings you use in config.json. This would give you more flexibility and you could easily compare a run that included longer sentences with one that didn’t, to see where the models match your needs best.

You’ll see at the start of training that it outputs details about the max and min length in the config and then shows how many sentences were excluded.

I’m just on a break at work so will need to follow up on your other points later

I fail on running with default ljspeech dataset.
Since @dkreutz seems to get the identical error i opened an issue on github.

File "", line 76, in <module>
    model = SpeakerEncoder(**c.model)
TypeError: type object argument after ** must be a mapping, not str

All tips are welcome.

Hi @mrthorstenm - from what I can see on the GitHub issue, the link to the model in Google Drive that you say you’re using is for one of the TTS models (2nd to last Tacotron2 entry in that table on Released Models page), but actually what you need to use here is the Speaker-Encoder-iter25k model. It’s the one that @sanjaesc shows in the screenshot in their reply a little further up this thread.

Then you should be able to run (or at least we’ll be further along to getting it working for you :wink:)

After chatting with @nmstoker i was able to compute embeddings in the released libri-tts dataset.
I documented my lessons-learned in the github issue and closed it.

@dkreutz made a bokeh plot on my ljspeech dataset. Thanks for that :slightly_smiling_face: .
The two clusters on the left side might result from recorded most phrases with a usb microphone in two different rooms (round about 12k phrases).
The smaller cluster on the right side was recorded with better equipment (incl. “popkiller”) and inside a smaller room (random sample). Round about 3k phrases.

Bokeh overview

Bokeh detail left clusters

Bokeh detail right cluster

Any pro tipps on the dataset before running training (again)?


It is a great job! Thx for keeping up everything in this thread.

The best solution is to record them in the best format again but of course, it’s a big toil. So you might maybe train the model with the larger cluster and see how it performs. Then, you can add the other clusters and see how the model behaves. If they reduce the performance you need to record them again, unfortunately.

You can also use denoising algorithms of neural models to prettify the broken clips. That might help.


You can also consider this problem as multi-speaker TTS. And you can train TTS model conditioning on these embedding vectors. Then, if the model works fine, you can regenerate poor clips with the TTS model providing the right embedding vector which matches a healthier recording.
(like the center of the larger cluster). This is something might work but, I never tried.


Hi erogol,
Dominik here, I am the working in the background with @mrthorstenm. Thanks for looking into this.

I already thought of applying RNNoise to the audio clips, but have to figure out a good workchain yet (probably with sox and a ladspa plugin?).

And you confirmed my idea to handle this as multi-speaker “problem”. We will definitely follow this idea - and come back to you with many question how to do it :wink:


Just a short update.

I just made recording number 18.000 which equates to 17 hours of audio material :smiley:.
@dkreutz optimized the wav files and is currently training with multi-speaker setup as suggested as “crazy idea” by @erogol .

When we are satisfied with the quality the new and optimized dataset will be published/updated on google drive for use by the community.


Thanks to @mrthorstenm there are now some 3.000 more audio clips. I need a heads up on how to extend the dataset which is already used in training.
We are using LJSpeech data format: Do I simply copy the additional audio files to the folder “wavs” and paste the corresponding metadata at the end of metadata_(train|val).csv and then continue training with latest checkpoint?

I had always meant to try that and I understood it to be possible but I must admit I’ve never actually tried it.

The key thing is whether the initial caching of phonemes gets done when the fine tuning option is selected. Am AFK right now but should be fairly easy to see in the code for If that did happen then the steps you mention sound like they’d work.

Phoneme caching is a good point - haven’t thought of that!

Looking at datasets/ I understand that phoneme file is automagically generated for a wav-file if it does not exist. Looking at my phoneme-cache folder confirms this as there are .npy files from different dates when I experimented with different datasets.

1 Like

So here we go: added new wav-files and appended entries to metadata-train/val.csv. Then started training again with --continue_path option.

No errors so far. Startup message “Number of instances” sums up correctly to the new total of the training set.

Phoneme cache folder has new files where names match with wav-files that were added.


Trained now approx. 7k steps/28 epochs with the extended data set. Alignment slowly improves, but loss is increasing again (no new best_model.pth.tar since the extending the data set).

Is this a reason to worry, should I stop training?

Btw: what is the difference between parameters --continue_path and --restore_path? I have used “continue”, should I try “restore” instead?

1 Like

let it train no worries. Decoder loss goes down.

–continue continues the training in the folder using the same folder as the output path

–restore restores the model but handles it as a new training run.


Training has reached 50k steps - time for another update and questions…

There was definitely an impact by adding audio files for (only) one of the speakers at 23k steps: - Loss values slowly but steadily increased - no more updates to best_model (because loss avg did not improve?) - StepTime increased - memory consumption increased (from 11GB to now 18GB) The audio example quality matches the ones from a previous run, but there are attention problems with longer sentences and german Umlaut phonemes: ä, ö, ü. I did not care too much about data-cleaner and Some training phrases contain “foreign words” like “olé” which don’t exist in german character set. Probably that is the reason for latter problem?

Turned out that I made a dumb error while preprocessing the additional wav-files and they were all messed up with a wrong sampling rate. Note to myself: always listen to the audios before you start training for several days…
After discussion with @mrthorstenm we will tackle the data-cleaner/symbols issue at the same time and start training from the beginning…


Fixed the wav-files and sorted out the data-cleaner/phoneme issue with help of @erogol. Fresh training session just started - I will report back when we see first results (or problems)…


Short update for statistic fans:

  • Phrases recorded: 18.036
  • Recorded audio length: 17 hours
  • Average sentence length: 47 chars
  • Chars per second (avg): 13,5
  • Sentences with question mark: 1.893
  • Sentences with exclamation mark: 1.462

The recordings are slower than my every-day speech, but therefor they are clear and without any characters swallowed (hopefully).

3.700 phrases remain for recording. After that i will finish my recordings and upload/update the complete dataset for community use.


We reached Epoch 340 /step 112.000 - time for an update.

You can clearly see where the gradual training r=3 kicked in at at 50k steps. Since that the loss values slowly increase again.

The Eval alignment was a (more or less) straight diagonal in between now again has that gap in the upper right.
The Train alignment looks better though:

Here are some audio (806,4 KB)
Overall it starts sounding good, there are some problems with the intonation of the german Umlaut vowels, probably we need to enhance the dataset with more examples for that.

Any comments, any reasons to worry?