Train Multispeaker Dataset + WaveRNN

Thanks for the help! Modifying the notebook worked quite well once i knew what to do. I’m currently retraining the model because it broke after getting to r=1. It didn’t recover from > Decoder stopped with ‘max_decoder_steps’ and broken even on short sentences. I also cleaned up the dataset, since it had some samples where the speaker would laugh -> Text (laught).

Regarding the audio parameters, i used the notebook to evaluate some of the values, but i would say the default values are good. Then again im not 100% sure what to pay attention to.
The average of “mel = AP.melspectrogram(wav)” fmel_min ~ 0.0000 and fmel_max ~ 0.7900

Another Update:

As said im re-training the model. What i did now is:

  • removed r=1 from the gradual training
  • at 230k steps changed the memory_size from -1 to 5

And with forward_attention enabled during inference the attention doesn’t break even on long sentences.

Now i though about switching to BN prenet at around ~400k steps, based on your comment here. Does it make sense?

I think once I’m done with this run, I’ll go back and try r=1 one more time.

PS: Some more examples here. Quite happy with the results.

2 Likes

Did you get a sense that the memory_size changing to 5 made a significant difference?

From the look of the lines on the charts they look fairly similar before and after, but maybe they helped with qualities not apparent in the charts alone.

I’m in a similar position with my training and am on the brink of attempting the 400k switch to finish off the model on BN. It’s still running right now but based on the charts it actually seems to be doing okay on r=1 (touch wood!), which had never happened with previous runs with earlier versions of the code. Thanks for posting your progress - I’ll be doing the same once I’ve tried the BN switch.

1 Like

I would say the only real difference I noticed was, that it stopped producing > Decoder stopped with ‘max_decoder_steps’. But if it’s in correlation with the memory_size change or me cleaning up the dataset, I cannot tell.

I now made the switch to BN at ~410k steps, let’s see how it performs.

I’ll try the same for r=1 after the current run and post the results for comparison.

1 Like

You should go with BN after you are sure with the attention performance of the model.

But afais, attention looks working fine.

Sooo long time no hear…

As previously said, i was about to train the model using BN. It didnt work.
Tried the switch to r=1. Didn’t work. :sweat_smile:

Cleaned up the dataset and retrained the model. Best so far is still at r=2.

Trained the new MelGAN vocoder to 600k steps, using the mels from multi-speaker model.

Here are some samples
https://soundcloud.com/sanjaesc-395770686/sets/multi-speaker-mozilla-tts-melgan-vocoder
and some more
https://soundcloud.com/sanjaesc-395770686/sets/multi-speaker-mozilla-tts-melgan-vocoder-2
PS: All use the same style_wav during inference

++++++++++++++++

On another note… I also tried training WaveRNN but will get

  File "/home/alexander/anaconda3/lib/python3.7/site-packages/torch-1.4.0-py3.7-linux-x86_64.egg/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/media/alexander/LinuxFS/Documents/PycharmProjects/TTS/WaveRNN/dataset.py", line 74, in collate
    coarse = np.stack(coarse).astype(np.float32)
  File "<__array_function__ internals>", line 6, in stack
  File "/home/alexander/anaconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 425, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

This user at github had the same issue

any tips?

PPS: Like here Contributing my german voice for tts i also noticed that the model has some problems with umlauts. It will pronounce o as ö and vice versa. Are graphemes the better choice here?

Best regards.

2 Likes

@sanjaesc Have you solved the ValueError: all input arrays must have the same shape problem? @erogol

Not rearlly… but training on bits seem to work.

@sanjaesc thanks sanjaesc. I have tried to train on 10 bits yesterday too. Hope that would work. Would you mind sharing your log ?

I’m currently experimenting with pwgan and melgan. Afterwards I’ll retrain the tts model using graphemes and give wavernn a try.

Thanks @sanjaesc. I have also tried pwgan too based on erogol project. May i ask which melgan repo that you tried ? I have tried this melgan repo but hop size could not change into 256, which is correspond to mozilla TTS.

MelGAN is also included in the pwgan repo ^^.
Just use one of the melgan.yaml configs.

Oh i see. Really appreciate it. Thanks a lot.

@petertsengruihon in case it’s useful, there’s a bit of detail on using MelGAN here: My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.

@nmstoker Thanks a lot. What you posted is really helpful. Not until i read your post and then i found pwgan would be a great shot.

1 Like

Is there any pretrained multispeaker model we can get ahold of anywhere? I am running some tests and would like to save some time training.

Here is the model i trained. 10 speaker, german
Hope it helps.

1 Like

whould be alright if I put it our models page?

I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training. :grinning:

2 Likes

thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json