Contributing my german voice for tts

Hello.

My name is Thorsten Müller, native german speaker and i currently use mimic-recording-studio for recording my voice for tts generation.
I’m using a corpus created by mycroft community member (gras64) taken phrases from https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt and have recorded 7k phrases (from 30k) with a duration of round about 6 hours at the moment.
I want to contribute these ljspeech data (metadata.csv and wav files) to the community.

Information and download on: https://github.com/thorstenMueller/deep-learning-german-tts

Hopefully it’s useful for somebody.

Thorsten

3 Likes

That’s a great contribution thx. I’ll share some results and feedback asap.

1 Like

You’re welcome. Thanks for planning to share results when tested with it.
I’m still recording at the moment and will update my wav files on google drive when i reached 10k recordings.

1 Like

Happy new year dear community :slightly_smiling_face:.

Since i’m still recording my voice for community contribution (for several month now) i want to give a short update. I’ve recorded 12600 phrases with a total audio length of 11 hours.


Direct dataset download: https://drive.google.com/open?id=1NTi-4r3EWl5dw0k2o4Xh92G0OHvhoxAJ

Results of analyze.py:

4 Likes

After performing a training with 100k steps and 14306 recorded phrases I found that the quality was not as desired. Dominik (@dkreutz) and Eltonico from the Mycroft Forum were kind enough to check the quality of my recordings. It turned out that some recordings had reverberation and echoes and therefore not ideal for TTS training.
Together with Dominik I try to identify and optimize the bad files. When that is complete, I will provide the link to the cleaned and optimized dataset here.
Many thanks to Dominik and Eltocino for their support in this matter.

3 Likes

After chatting with @dkreutz he recommended to document the progress and lessons learned for the community. As i think this is a good idea i (and probable Dominik) will update this thread on a regular basis.

At the moment i listen to all my recordings and categorize all wavs in green (good), yellow (needs revision), red (removed from dataset) while Dominik starts optimizing the files.

After removing the red ones, and doing some optimization by Dominik on the yellow ones we hopefully have an acceptable dataset for german tts generation.

I uploaded some recorded samples on my github page (https://github.com/thorstenMueller/deep-learning-german-tts) just for interested people to get an impression of the sound of my voice.

Lessons learned so far:

  • Mimic-Recording-Studio (by MyCroft) records with “RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz” which is higher sample rate than required (16000-22050Hz). Stereo should not be needed.
  • Beware your recording room situation (reverb and random noise)
  • Always keep some distance between mouth and mic
  • Use a good mic and speaker for reviewing your audio (that’s mentioned in several places, so please take it serious)

Even if i use this thread for documenting the progress this should not be a soliloquy thread so feedback of every kind is welcome.

4 Likes

Great work @mrthorstenm - I look forward to hearing more on the techniques.

You might also be interested in the speaker embedding code in the dev branch here: https://github.com/mozilla/TTS/tree/dev/speaker_encoder

Using it with my own single speaker dataset I was able to identify a small but distinct cluster of my recordings which were more muffled than the majority. Curiously it was picking up that I had a mild cold with some of the audio (which at the time I’d thought wasn’t audible but turned out to be) That was impacting the quality of the output audio from my trained model and I saw an improvement when I pruned it.

1 Like

Thanks @nmstoker for the compliment and the link to speaker_encoder.
I’m gonna give it a try as soon as i removed the obviously bad files from the dataset.

Even if there’s nothing new to show right now (no new dataset upload yet), i’m not lazy on this topic and want to keep this thread updated.

  1. I removed really bad files from the dataset
  2. I converted the files to mono and 22k sample rate (from stereo and 44k sample rate)
  3. @dkreutz listened to the “yellow” files and is optimistic that he can optimize most files (reverb and random noise) whenever his rare spare time allows to invest time on it
  4. I borrowed/bought semi-professional or at least better microphone equipment and set up a better location for futher recordings.

I continue on recording new phrases with a more optimized setup. So i hope that we (@dkreutz and i) can present a new cleaner dataset in the future.

3 Likes