Train Multispeaker Dataset + WaveRNN

Hello everyone, I created a custom German dataset extracting audiofiles from a game called Gothic. I’ve succesfully trained a model using the repo from fatchord on one speaker. Here you can hear samples from the main hero of the game.

Any tips on how I would go ahead to train a multispeaker model? Would I have to split every speaker into separate folders? Is there an entry point I could start reading about this topic?
I’m also interested in training a vocoder, would it be possible to train an universal vocoder explicit for this dataset?

Thanks for the repository and all the work you put into.
Best regards.

1 Like

Sounds great. Is the game State of Mind? (Just curious)

TTS should already work for multi speaker. The best way is to format your dataset as LibriTTS and use the formatter already there.

Regarding the vocoder. We already trained and released an universal vocoder in WaveRNN repo . It works for different speakers and languages nicely.

1 Like

The game series is called Gothic.

I’m currently in my exam period, so I didn’t have much time experimenting.

I’ve trained a model to 130k steps, using Tacotron multi speaker and GST. It learned a decent attention at 130k steps. I’ll include more speakers and fully train the model, once I have more time. Passing a style_wav during inference, also changed the prosody of the generated speech. I’m still not sure how to work with the Tokens from GST but I’ll read more about it, once the exams are over.

I haven’t tried the universal vocoder yet, but my dataset has a SR of 22050. How would I go ahead and finetune the existing model?

I’ll post the results later on.

Thanks and best regards.


I’m training on 10 speakers, using GST.

You can listen to some test sentences here. Link
Without a style-wav the speech is way to fast.

I mostly used the default config.json from the master branch, also in the link above. Any tips on the config? Should i tune some parameters?

At ~200k I accidently stopped the training and had to resume it, not sure if it influenced it in a bad way.

I’ll let it run for a little bit longer.

1 Like

Another Update:

Here are some more examples from the multi speaker model. Generated at around ~320k steps. Also had the warning “> Decoder stopped with ‘max_decoder_steps’”, but it fixed itself after some iterations.

On another note. If I want to extract the spectograms from the TTS model, how would i go ahead doing so? The notebook ExtractTTSpectrogram.ipynb hast the following lines.

# TODO: multiple speaker
model = setup_model(num_chars, num_speakers=0, c=C)

Any tips on what changes are needed to extract mels from a multi speaker model?
Help on this matter would be highly appreciated.

Best regards.

Thx for sharing the results. It looks quite interesting and given the figures, your training looks quite healthy.

For using the notebook, you need to give the right number of speakers to the model and you need to provide the list of speakers. Speakers are already exported to a json file. You need to load them and pass the list to the model as well. If you look at the source code how it does the inference, it is quite easy to implement it.

1 Like

You might also consider to set audio parameters. Because in general male and female voices require different values. You can try CheckSpectrograms notebook to optimize these values.

1 Like

Thanks for the help! Modifying the notebook worked quite well once i knew what to do. I’m currently retraining the model because it broke after getting to r=1. It didn’t recover from > Decoder stopped with ‘max_decoder_steps’ and broken even on short sentences. I also cleaned up the dataset, since it had some samples where the speaker would laugh -> Text (laught).

Regarding the audio parameters, i used the notebook to evaluate some of the values, but i would say the default values are good. Then again im not 100% sure what to pay attention to.
The average of “mel = AP.melspectrogram(wav)” fmel_min ~ 0.0000 and fmel_max ~ 0.7900

Another Update:

As said im re-training the model. What i did now is:

  • removed r=1 from the gradual training
  • at 230k steps changed the memory_size from -1 to 5

And with forward_attention enabled during inference the attention doesn’t break even on long sentences.

Now i though about switching to BN prenet at around ~400k steps, based on your comment here. Does it make sense?

I think once I’m done with this run, I’ll go back and try r=1 one more time.

PS: Some more examples here. Quite happy with the results.


Did you get a sense that the memory_size changing to 5 made a significant difference?

From the look of the lines on the charts they look fairly similar before and after, but maybe they helped with qualities not apparent in the charts alone.

I’m in a similar position with my training and am on the brink of attempting the 400k switch to finish off the model on BN. It’s still running right now but based on the charts it actually seems to be doing okay on r=1 (touch wood!), which had never happened with previous runs with earlier versions of the code. Thanks for posting your progress - I’ll be doing the same once I’ve tried the BN switch.

1 Like

I would say the only real difference I noticed was, that it stopped producing > Decoder stopped with ‘max_decoder_steps’. But if it’s in correlation with the memory_size change or me cleaning up the dataset, I cannot tell.

I now made the switch to BN at ~410k steps, let’s see how it performs.

I’ll try the same for r=1 after the current run and post the results for comparison.

1 Like

You should go with BN after you are sure with the attention performance of the model.

But afais, attention looks working fine.

Sooo long time no hear…

As previously said, i was about to train the model using BN. It didnt work.
Tried the switch to r=1. Didn’t work. :sweat_smile:

Cleaned up the dataset and retrained the model. Best so far is still at r=2.

Trained the new MelGAN vocoder to 600k steps, using the mels from multi-speaker model.

Here are some samples
and some more
PS: All use the same style_wav during inference


On another note… I also tried training WaveRNN but will get

  File "/home/alexander/anaconda3/lib/python3.7/site-packages/torch-1.4.0-py3.7-linux-x86_64.egg/torch/utils/data/_utils/", line 47, in fetch
    return self.collate_fn(data)
  File "/media/alexander/LinuxFS/Documents/PycharmProjects/TTS/WaveRNN/", line 74, in collate
    coarse = np.stack(coarse).astype(np.float32)
  File "<__array_function__ internals>", line 6, in stack
  File "/home/alexander/anaconda3/lib/python3.7/site-packages/numpy/core/", line 425, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

This user at github had the same issue

any tips?

PPS: Like here Contributing my german voice for tts i also noticed that the model has some problems with umlauts. It will pronounce o as ö and vice versa. Are graphemes the better choice here?

Best regards.


@sanjaesc Have you solved the ValueError: all input arrays must have the same shape problem? @erogol

Not rearlly… but training on bits seem to work.

@sanjaesc thanks sanjaesc. I have tried to train on 10 bits yesterday too. Hope that would work. Would you mind sharing your log ?

I’m currently experimenting with pwgan and melgan. Afterwards I’ll retrain the tts model using graphemes and give wavernn a try.

Thanks @sanjaesc. I have also tried pwgan too based on erogol project. May i ask which melgan repo that you tried ? I have tried this melgan repo but hop size could not change into 256, which is correspond to mozilla TTS.

MelGAN is also included in the pwgan repo ^^.
Just use one of the melgan.yaml configs.

Oh i see. Really appreciate it. Thanks a lot.