Train Multispeaker Dataset + WaveRNN

@petertsengruihon in case it’s useful, there’s a bit of detail on using MelGAN here: My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.

@nmstoker Thanks a lot. What you posted is really helpful. Not until i read your post and then i found pwgan would be a great shot.

1 Like

Is there any pretrained multispeaker model we can get ahold of anywhere? I am running some tests and would like to save some time training.

Here is the model i trained. 10 speaker, german
Hope it helps.

1 Like

whould be alright if I put it our models page?

I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training. :grinning:

2 Likes

thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json

Can you share your config.json (and any modifications to symbols.py etc.)? It would be highly appreciated.

1 Like

Just add ‘äöüßÄÖÜ’ to _characters in symbols.py, mine looks like this.

‘ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüßÄÖÜ!’(),-.:;? ’

config.zip (3,2 KB)
For the config I just used the basic_cleaners and of course disabled phonemes.
I don’t have any abbreviations in my dataset and already expanded all the numbers, so the basic_cleaner is enough.

1 Like

Sounds amazing! Would be really interesting to see / hear where it develops.

Hi,

just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of

speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items

by chance all the evaluation items are pulled from speaker 1. Does it matter?

code for splitting in generic_utils.py

def split_dataset(items):
    is_multi_speaker = False
    speakers = [item[-1] for item in items]
    is_multi_speaker = len(set(speakers)) > 1
    eval_split_size = 500 if len(items) * 0.01 > 500 else int(
        len(items) * 0.01)
    np.random.seed(0)
    np.random.shuffle(items)
    if is_multi_speaker:
        items_eval = []
        # most stupid code ever -- Fix it !
        while len(items_eval) < eval_split_size:
            speakers = [item[-1] for item in items]
            speaker_counter = Counter(speakers)
            item_idx = np.random.randint(0, len(items))
            if speaker_counter[items[item_idx][-1]] > 1:
                items_eval.append(items[item_idx])
                del items[item_idx]
        return items_eval, items
    else:
        return items[:eval_split_size], items[eval_split_size:]

yes it matters. How would you check the model performance on other speakers in that case?

well yeah thats what i was thinking :smiley: its just that currently there is a chance that a speaker wont get any evaluation data.

Something like this should work i guess? Calculate eval for every speaker seperatly.

if is_multi_speaker:
    speaker_list = [item[-1] for item in items]
    speaker_list = set(speaker_list)

    items_eval = []
    items_train = []
    for speaker in speaker_list:
        temp_item_list = []
        for item in items:
            if speaker in item[-1]:
                temp_item_list.append(item)
        eval_split_size = 500 if len(temp_item_list) * 0.01 > 500 else int(
            len(temp_item_list) * 0.01)
        temp_eval = temp_item_list[ : eval_split_size]
        temp_test = temp_item_list[eval_split_size : ]
        items_eval.extend(temp_eval)
        items_train.extend(temp_test)
    return items_eval, items_train

The current splitting method in TTS should handle that if there is nor bug :slight_smile:

Send a PR please if you see any mistakes

Hi everyone,

decided on giving a short update on the current status.
In the past months I was trying out different configurations of Mozilla TTS.

Trained:

  • T1 single-speaker/ multi-speaker models. (Both models worked quite well.)
  • T1 single-speaker/ multi-speaker models with GST. (multi-speaker with GST didnt really work.)
  • T2 single-speaker model. (This felt most human-like)

The goal was to train a multi-speaker model with GST support.
So i extended the tacotron2 model with support for speaker-embeddings and gst using Mellotron from Nvidia as guideline.

Instead of summing the Embeddings I concatenate them. Ref. Mellotron.

From my personal point of view this has led to much better results in training a t2 multi-speaker model with gst support.

Currently I’m training a model with 31 speakers of which some have only 10min of training data. Still the results are outstanding!

Here are some samples:
Soundcloud

My fork: Link

5 Likes

@sanjaesc Nice! Did you ever fix the problem you had above with wavernn? Im running into the same problem with input arrays shape

results sounds really good. With a vocoder intact that would be perfect. Do you have a plan to send a PR on that? Also @edresson1 would be interested to see these results.

@sanjaesc In my experiments, I did something very similar, but I used external embeddings, GST did the same way following Mellotron too. In my multi-speaker experiments I got better results with “original” attention, “Graves” attention didn’t sound very good! Did you try Graves attention?

Hey sorry for the late reply. I didnt really invest more time in WaveRNN, but I think it has something to do with short samples. You can try by removing those. Sorry can’t really help you here :sweat_smile:.