Contributing my german voice for tts


My name is Thorsten Müller, native german speaker and i currently use mimic-recording-studio for recording my voice for tts generation.
I’m using a corpus created by mycroft community member (gras64) taken phrases from and have recorded 7k phrases (from 30k) with a duration of round about 6 hours at the moment.
I want to contribute these ljspeech data (metadata.csv and wav files) to the community.

Information and download on:

Hopefully it’s useful for somebody.



That’s a great contribution thx. I’ll share some results and feedback asap.

1 Like

You’re welcome. Thanks for planning to share results when tested with it.
I’m still recording at the moment and will update my wav files on google drive when i reached 10k recordings.

1 Like

Happy new year dear community :slightly_smiling_face:.

Since i’m still recording my voice for community contribution (for several month now) i want to give a short update. I’ve recorded 12600 phrases with a total audio length of 11 hours.

Direct dataset download:

Results of


After performing a training with 100k steps and 14306 recorded phrases I found that the quality was not as desired. Dominik (@dkreutz) and Eltonico from the Mycroft Forum were kind enough to check the quality of my recordings. It turned out that some recordings had reverberation and echoes and therefore not ideal for TTS training.
Together with Dominik I try to identify and optimize the bad files. When that is complete, I will provide the link to the cleaned and optimized dataset here.
Many thanks to Dominik and Eltocino for their support in this matter.


After chatting with @dkreutz he recommended to document the progress and lessons learned for the community. As i think this is a good idea i (and probable Dominik) will update this thread on a regular basis.

At the moment i listen to all my recordings and categorize all wavs in green (good), yellow (needs revision), red (removed from dataset) while Dominik starts optimizing the files.

After removing the red ones, and doing some optimization by Dominik on the yellow ones we hopefully have an acceptable dataset for german tts generation.

I uploaded some recorded samples on my github page ( just for interested people to get an impression of the sound of my voice.

Lessons learned so far:

  • Mimic-Recording-Studio (by MyCroft) records with “RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz” which is higher sample rate than required (16000-22050Hz). Stereo should not be needed.
  • Beware your recording room situation (reverb and random noise)
  • Always keep some distance between mouth and mic
  • Use a good mic and speaker for reviewing your audio (that’s mentioned in several places, so please take it serious)

Even if i use this thread for documenting the progress this should not be a soliloquy thread so feedback of every kind is welcome.


Great work @mrthorstenm - I look forward to hearing more on the techniques.

You might also be interested in the speaker embedding code in the dev branch here:

Using it with my own single speaker dataset I was able to identify a small but distinct cluster of my recordings which were more muffled than the majority. Curiously it was picking up that I had a mild cold with some of the audio (which at the time I’d thought wasn’t audible but turned out to be) That was impacting the quality of the output audio from my trained model and I saw an improvement when I pruned it.

1 Like

Thanks @nmstoker for the compliment and the link to speaker_encoder.
I’m gonna give it a try as soon as i removed the obviously bad files from the dataset.

Even if there’s nothing new to show right now (no new dataset upload yet), i’m not lazy on this topic and want to keep this thread updated.

  1. I removed really bad files from the dataset
  2. I converted the files to mono and 22k sample rate (from stereo and 44k sample rate)
  3. @dkreutz listened to the “yellow” files and is optimistic that he can optimize most files (reverb and random noise) whenever his rare spare time allows to invest time on it
  4. I borrowed/bought semi-professional or at least better microphone equipment and set up a better location for futher recordings.

I continue on recording new phrases with a more optimized setup. So i hope that we (@dkreutz and i) can present a new cleaner dataset in the future.


Currently i’ve recorded further 2.000 phrases with new/better equipment and room situation (still recording) and after 1.300 phrases i started a “quick” training run up to 72k steps.
The result sounds not too bad, even it’s still robotic. Due upload restrictions here i uploaded the sample wav file to my github account at

But what’s wrong with my dataset because of the horizontal line at 60?

I wanted to check the quality of my dataset with tool by @nmstoker on, but i struggle with this information:

shuf metadata.csv > metadata_shuf.csv
head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv

Do these values (12000 and 1100) come from this dataset which consists off 13.100 audio clips.
So would these values on a 10.000 clips ljspeech dataset mean 8.900 (train) and 1.100 (validate) or on which calculation depend these values?

1 Like

Hi @mrthorstenm. That section where you’ve got the shuf, head and tail commands is for setting up the LJ Speech dataset for training with the main part of the repo. That shouldn’t be needed for using the Speaker Encoder notebook (although you certainly could use that dataset and look for similar audio samples).

You’re right that the 12,000 and 1,100 values come from the total size of LJ Speech, as those steps are effectively splitting the whole dataset into a training and a validation set.

But if you simply want to look through your own audio, eg to find any unusual clusters, then you don’t need to split it up, just work on all the audio together. You can just feed it all the embeddings from your entire dataset and then the notebook will plot them in Bokeh.

To get the embeddings, have a look at (reading it again now, this could be a little clearer in the instructions :slightly_smiling_face:). You basically point it at your audio files and it creates a whole load of .npy files that are then used by the notebook.

I’m away from my computer right now, but let me know how you get on and if you’re stuck I’m happy to help try to figure it out.


Thanks for your clarification @nmstoker :slightly_smiling_face:.

Since i used the mycroft mimic2 i’ve got no best_model.pth.tar. When i tried running from the mozilla tts “speaker_encoder” folder i ran into problems with umap.
Removing umap and installing umap-learn 0.3.10 fixed this issue for me.

So i’m currently running a training with the mozilla version of

Am i right that a training-run (until which step is wise?) is required for

Currently i get the following error when try to run:

python3 ./ model_path /home/thorsten/___dev/libri_tts/speaker_encoder/libritts_360-half-February-08-2020_10+27PM-e37503c/best_model.pth.tar config_path config.json 
Traceback (most recent call last):
  File "./", line 41, in <module>
    c = load_config(args.config_path)
  File "/home/thorsten/___dev/mozilla/TTS/tts_namespace/TTS/utils/", line 23, in load_config
    input_str =
  File "/usr/lib/python3.6/", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

But i’m gonna check this issue tomorrow.
Thank you and good night for now :sleeping:


  • I use dev branch of repo.
  • Even when enter a not existing config file on command line the error stays.
python3 ./ model_path /home/thorsten/___dev/libri_tts/speaker_encoder/libritts_360-half-February-08-2020_10+27PM-e37503c/best_model.pth.tar config_path ./configUNKNOWN.json
Traceback (most recent call last):
  File "./", line 41, in <module>
    c = load_config(args.config_path)
  File "/home/thorsten/___dev/mozilla/TTS/tts_namespace/TTS/utils/", line 23, in load_config
    input_str =
  File "/usr/lib/python3.6/", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Edit: sorry, I don’t know how but earlier I think I misunderstood what you were doing (so what I wrote before probably wasn’t helpful :slightly_frowning_face: sorry!)

It might be easiest to have a go using it with the libritts 360 or 100 files (available here: NB: 360 set is 27Gb!) just so you can be sure you’ve recreated what’s in the notebook and only then swap in your files.

I don’t think you’d need to do the training step - you can try the pretrained model “as is” first with first just to confirm that you can get the embeddings and then use the notebook. Roughly it should be something like this:

  1. Start with the pretrained model
  2. Use it to create the embeddings for LibriTTS files
  3. Plot these using the notebook ( PlotUmapLibriTTS.ipynb)
  4. Go back to step 2 to create embeddings for your audio and try step 3
  5. Explore to see that audio is grouped by similar recordings - in my case I had one main cluster, with the next biggest cluster being an offshoot that had muffled audio.
  6. If there aren’t any discernible patterns to your clusters with your audio, then it is possible that it would need to be re-trained (but that wasn’t something I needed to do myself)

BTW: you are correct to install umap-learn

Not sure about the utf-8 errors

1 Like

It might also be worth having a look at too as that enables similar clustering.

What this version in TTS does is use the pretrained model Eren created along with the interactive plotting where you can listen to the audio by clicking on the chart (thus easily attempt to discover what connects the clusters)

Thanks for your responses.

The confusion is my fault since my code snipplets shows the string “libritts_360”. But that’s just the name in the default config.yaml.
I use an ljspeech dataset. But you’re right. I will check my setup with a pretrained model first just to be sure all libs and dependencies are correct before trying to verify my own ljspeech dataset.

1 Like

Dear Neil,
I wonder if you could be as helpful for italian language as you have been here for german.
I am trying to replicate steps suggested,
so I downloaded LibriTTS 100 version (english) - OK

  1. Start with the pretrained model
  2. Use it to create the embeddings for LibriTTS files
    Here I start having some problems uderstanding your suggestion.
    I have read in Multi Speaker Embeddings
    then I went to
    and tried to follow instructions there like:

Download a pretrained model from Released Models page.

To run the code, you need to follow the same flow as in TTS.

  • Define ‘config.json’ for your needs. Note that, audio parameters should match your TTS model.
  • Example training call python speaker_encoder/ --config_path speaker_encoder/config.json --data_path ~/Data/Libri-TTS/train-clean-360
  • Generate embedding vectors python speaker_encoder/ --use_cuda true /model/path/best_model.pth.tar model/config/path/config.json dataset/path/ output_path . This code parses all .wav files at the given dataset path and generates the same folder structure under the output path with the generated embedding files.
  • Watch training on Tensorboard as in TTS

As step 1 and 2 are the start you suggest, could you give some more detailed guidance so that I can proceed?



adding on my previous post, let we add this additional piece of usage help info coming from:

TTS has a subproject, called Speaker Encoder.
It is an implementation of . There is also a released model trained on LibriTTS dataset with ~1000 speakers in Released Models page.

Ok, so I go to the Released-Models page and I try to choose the good one.
But where or which one is the one you would suggest?
I would go for the Tacotron-iter-170K defined as “More stable and longer trained model.”
But if I get it right this is for LJSpeech, a different set than the LibriTTS you suggest using whose “corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts”.

LJSpeech instead is a “speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.”.

So, at the end, can you help in clarifying this inconsistency?
Or am I messing up things between TTS/ speaker_encoder github instructions
and and those in the wiki at
being a newcomer here?

thanks for your help

Ok, so I go to the Released-Models page and I try to choose the good one.
But where or which one is the one you would suggest?

There is a pre-trained Speaker Encoder Model in the middle.



1 Like

Thanks, sorry for the inconvenience.
As posted elsewhere I am trying to setup a first italian language attempt starting from the voices collectd in the Mozilla Common Voice Project.
As far as I read around here due to the amount of different voices and difference in quality, we should not have too high expectations,
Then I would also start a single italian voice attempt.
thanks again

Hi @smg, I hope I can help but it would be good to confirm your objective with the speaker embeddings here as there are some different ways it can be used.

You mention that you’re trying to use the various Italian speakers from Common Voice. Are you trying to produce a) a multi speaker model as per issue #166 or b) trying to train a single TTS but simply using all the audio from a range of different speakers?

If a): I haven’t looked at multi speaker models myself so I would have to defer to the expertise of others, but the impression I got from reading #166 is that it’s not able to produce such clear voices (yet)

If b): That’s not something I’ve tried, but with my recordings of a single speaker (me!) I’ve seen that including audio where I haven’t spoken in a really consistent speaking style results in a big impact on output voice quality, so I’m thinking it would be unlikely to work well at all where the speakers were different people.

The part I expect I can help with is if you’re looking at using the speaker embeddings tools to help analyse the audio you’re planning to use for training a TTS model.

If that’s what you want then I’d be trying to guide you to recreate what’s shown in the video for the multi speaker audio or for a single speaker.

To do that you’d need the pertained model from Released Models. It is the one that @sanjaesc showed.

You’d want to go to the /speaker_encoder directory and adjust the paths in config.json so they pointed at your audio for the datasets path and at a suitable location for the output_path (this is where you’ll be saving the embeddings, with are .npy files that correspond to each of your audio files). You can also just pass these paths as parameters to too, so I’m not sure strictly how necessary it is to change them in config.json. Also to clarify, the contents of the config.json in that directory are currently the same as the config.json that comes with the model, but I expect you’d want to use the ones supplied with the model you’re using if they were to differ (for instance if you’d trained a speaker embedding model with different parameters)

Then run, passing the relevant arguments (so the use_cuda parameter and then the model you downloaded above (as per @sanjaesc), the config and the paths. This works through your audio files and produces the .npy files I mentioned above.

Then go into the Notebook directory and launch the Jupiter notebook. It’ll need some basic edits to point the paths at your files, but this should be fairly obvious (eg MODEL_RUN_PATH and the others must point to somewhere on your computer). Depending on where you stored your embeddings you may need to edit that glob command for embed_files.

Also under Set up the embeddings, comment/uncomment the relevant part for single speaker or multispeaker.

Finally you should be in a position to step through the Notebook :slightly_smiling_face:

If all goes smoothly you should see the Bokeh chart near the end of the notebook. Don’t forget to run the final cell, which starts a local server, so that the hyperlinks on each plotted embedding point to the corresponding audio file (this is the way that you can then click on the various sections of the chart and hear what the corresponding audio sounds like, as is shown in the video)

So, that’s quite a few steps! I hope I’ve got it all right and been clear. Have a go and if you run into any problems you can’t figure out, report back as much useful detail as you can and I’ll do my best to assist :+1: