For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.
I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training.
thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json
config.zip (3,2 KB)
For the config I just used the basic_cleaners and of course disabled phonemes.
I don’t have any abbreviations in my dataset and already expanded all the numbers, so the basic_cleaner is enough.
just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of
decided on giving a short update on the current status.
In the past months I was trying out different configurations of Mozilla TTS.
Trained:
T1 single-speaker/ multi-speaker models. (Both models worked quite well.)
T1 single-speaker/ multi-speaker models with GST. (multi-speaker with GST didnt really work.)
T2 single-speaker model. (This felt most human-like)
The goal was to train a multi-speaker model with GST support.
So i extended the tacotron2 model with support for speaker-embeddings and gst using Mellotron from Nvidia as guideline.
Instead of summing the Embeddings I concatenate them. Ref. Mellotron.
From my personal point of view this has led to much better results in training a t2 multi-speaker model with gst support.
Currently I’m training a model with 31 speakers of which some have only 10min of training data. Still the results are outstanding!
results sounds really good. With a vocoder intact that would be perfect. Do you have a plan to send a PR on that? Also @edresson1 would be interested to see these results.
@sanjaesc In my experiments, I did something very similar, but I used external embeddings, GST did the same way following Mellotron too. In my multi-speaker experiments I got better results with “original” attention, “Graves” attention didn’t sound very good! Did you try Graves attention?
Hey sorry for the late reply. I didnt really invest more time in WaveRNN, but I think it has something to do with short samples. You can try by removing those. Sorry can’t really help you here .