Contributing my german voice for tts

My German is v rusty but it sounds like it’s coming along well. Would be interesting to hear it after the r=2 stage which isn’t that far (assuming you’ve left the gradual training stages as per those in config.json)

2 Likes

When my understanding of config.json is correct we’ll be switching to r=2 on step 130k.

@dkreutz Should i modify the sentence order in corpus to record all sentences containing german umlauts next (so we have all umlaut sentences on next (appended) training run?

BTW I don’t want to divert you if this is something that was already fixed but I see that problems with handling umlauts were mentioned here:

@mrthorstenm I see you’d commented on that issue but not sure how far up it you’d read, plus it was a little while back.

Thanks @nmstoker for the link.
Looking forward to take a closer look.

Currently i’m trying to start a server with the “best_model.tar” and config.json provided by @dkreutz for our speaker-embedded dataset.
Based on https://github.com/mozilla/TTS/tree/master/server i was able to create and install a TTS-0.0.1+df42a4a-py3-none-any.whl in ./dist directory.
But running python -m TTS.server.server failed with following output:

(venv) thorsten@thorsten-desktop:~/___dev/tts/mozilla/branch-dev/TTS$ python -m TTS.server.server
 > Loading TTS model ...
 | > model config:  /tmp/venv/lib/python3.6/site-packages/TTS/server/model/config.json
 | > checkpoint file:  /tmp/venv/lib/python3.6/site-packages/TTS/server/model/checkpoint.pth.tar
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:12.5
 | > frame_length_ms:50
 | > ref_level_db:20
 | > num_freq:1025
 | > power:1.5
 | > preemphasis:0.98
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:8000.0
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > sound_norm:False
 | > n_fft:2048
 | > hop_length:275
 | > win_length:1100
 > Using model: Tacotron2
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/venv/lib/python3.6/site-packages/TTS/server/server.py", line 39, in <module>
    synthesizer = Synthesizer(config)
  File "/tmp/venv/lib/python3.6/site-packages/TTS/server/synthesizer.py", line 31, in __init__
    self.config.use_cuda)
  File "/tmp/venv/lib/python3.6/site-packages/TTS/server/synthesizer.py", line 58, in load_tts
    self.tts_model.load_state_dict(cp['model'])
  File "/tmp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Tacotron2:
	Unexpected key(s) in state_dict: "speaker_embedding.weight". 
(venv) thorsten@thorsten-desktop:~/___dev/tts/mozilla/branch-dev/TTS$ 

I tried running python -m TTS.server.server using following branches without success:

  • master
  • dev
  • fix_server
  • fix_db7f3d3

I found the following “issue” on that but i’m unsure if that addresses my problem.

What could be the problem on that?

@nmstoker Thanks for the pointer - sounds like I made the wrong decision by setting use_phonemes = true :upside_down_face:

I will keep on training with the current setting for now and give it another try with use_phonemes = false or when @mrthorstenm has finished recording his dataset.

I could “fix” the not running tts-server issue quick’n dirty by commenting out lines 819 - 822 in file

/tmp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py

After that the server starts without problems.

Your examples sound quite good (and up to date :)).

Actually you can already reduce the reduction rate r to 2 I guess since the model looks saturated in r=3

1 Like

“r” has already been reduced to “2” on reaching step 130k. Next reducation would be on step 290k to “1”.
These are the newest graphs and sample wavs @dkreutz provided me yesterday on step 167k.

Three audio samples:
Samples_167k.zip (358,2 KB)

1 Like

Today Epoch 500 / step 185k was reached and there is no more improvement visible/audible. As there are still the problems with Umlaut pronounciation I stopped training for now.
I want to start a new training with “use_phonemes=false”.
Any suggestion to modify some of the other parameters, e.g. “use_forward_attn” or “bidirectional_decoder”?

2 Likes

Here a sample from tts-server loaded with 185k model.

Vielen Dank an die Community von Mozilla und Meicroft.

Many thanks to the community of Mozilla and MyCroft and of course @dkreutz for amazing support on this.

5 Likes

For those interested - here is a link to the 185k model

(Note to visitors from a far and distant future - the download link might have disappeared)

3 Likes

What is your opinion on the following non-technical aspect of this topic.
I have decided to donate my voice to the community so that (OpenSource) projects have the opportunity to offer a qualitatively usable German voice in their project free of charge (and offline), regardless of the major cloud providers such as Amazon or Google.

I also see that several people record their voice for TTS but keep it private. Of course, I can understand this for data protection reasons. However, I ask myself the following questions:

  • Am I careless to donate my voice (incl. dataset) to the general public?
  • Do I lose control over part of my identity?
  • Should I be worried that my voice will be used for illegal activities?
  • Am I legally in a problematic situation by donating my voice?
  • Why don’t more people contribute their voice?

That doesn’t change the fact that I want to donate my voice to give part back to the community.

However, I would be interested in your opinion on these points.

2 Likes

Hi Thorsten,
I am grateful that you recorded your voice, it takes up a lot of time and energy. Thanks.

Looking at advances in transfer learning, it will most likely be possible to clone a voice with just some samples in the future, so we all have to worry about losing our identity, same as with images :slight_smile:

I don’t think you should be legally worried, your CCO-1 license should suffice. In German courts it is the intention that counts.

In time we’ll have a couple of different speaker sets for German and it is wonderful that we will be able to find differences between them and learn through that experience.

Thanks for your efforts
Olaf

1 Like

Hi Oliver.

Thanks for your feedback.

You’re right on time and energy. I started recording six month ago and record since that on a regular basis. Probably when i’m finished with recording it’s enough to record 10 phrases to clone your voice :upside_down_face: .

Whenever we are finished (i with recording and @dkreutz with tacotron2 training) i’m excited on the results and if my model is used by anyone else.

1 Like

@dkreutz Wanted to test it on my Corona holidays, what config.json did you use to train the 185k model? Standard Tacotron?

Thanks in advance

Oh, and what revision did you train it on? I get the following error for the current master branch indicating that you were using a different one :slight_smile:

RuntimeError: Error(s) in loading state_dict for Tacotron2:
	Missing key(s) in state_dict: "decoder.prenet.layers.0.bn.weight", "decoder.prenet.layers.0.bn.bias", "decoder.prenet.layers.0.bn.running_mean", "decoder.prenet.layers.0.bn.running_var", "decoder.prenet.layers.1.bn.weight", "decoder.prenet.layers.1.bn.bias", "decoder.prenet.layers.1.bn.running_mean", "decoder.prenet.layers.1.bn.running_var". 
	Unexpected key(s) in state_dict: "speaker_embedding.weight".

My config is based on standard Tacotron2 model with some modification for german phonemes and speaker embedding (see “crazy idea” above).
185k-config.zip (4,0 KB)

I am using a fork of TTS master from end of January. Some minor changes were applied (dataset/preprocess.py, utils/text/symbols.py) but I didn’t find time to commit them to my repo, yet.

2 Likes

@dkreutz Thanks for that, I tried to use that but get the error

RuntimeError: size mismatch, m1: [1 x 560], m2: [80 x 256] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:290

which could mean, I am on the wrong branch. I am using HEAD

e37503cb710bb229d8adc12a40d73338f1201351

Do you have any idea what I am doing wrong?

And what speaker_id are you using for inference?

What is your torch version, mine is 1.3.0?

For the speaker_id I modified datasets/preprocess.py and cloned the LJSpeech function in the way that the last part of the dataset-path is used as speaker_id, e.g. my/data/path/speaker-1 will return speaker_id=“speaker-1”.

def thorsten(root_path, meta_file):
    """Normalizes the Nancy meta data file to TTS format"""
    txt_file = os.path.join(root_path, meta_file)
    items = []
    ps = root_path.split("/")
    if ps[-1]:
        speaker_name = ps[-1]
    elif ps[-2]:
        speaker_name = ps[-2]
    else:
        speaker_name = "thorsten_1"
    with open(txt_file, 'r') as ttf:
        for line in ttf:
            cols = line.split('|')
            wav_file = os.path.join(root_path, 'wavs', cols[0] + '.wav')
            text = cols[1]
            items.append([text, wav_file, speaker_name])
    return items

I didn’t user synthesize.py or server.py for inference yet, only listened to the audios provided by tensorboard. Maybe @mrthorstenm can help out here?

It looks like it is quite hard to do inference with the information given. Tried it, but only got junk out. Probably wrong model generation on my side. Here is my notebook, if sb wants to give it a go:

https://colab.research.google.com/drive/1yn44TbwvZmnoiWgJ13Q7Ad9j3LEOE1XC