Train Multispeaker Dataset + WaveRNN

@sanjaesc thanks sanjaesc. I have tried to train on 10 bits yesterday too. Hope that would work. Would you mind sharing your log ?

I’m currently experimenting with pwgan and melgan. Afterwards I’ll retrain the tts model using graphemes and give wavernn a try.

Thanks @sanjaesc. I have also tried pwgan too based on erogol project. May i ask which melgan repo that you tried ? I have tried this melgan repo but hop size could not change into 256, which is correspond to mozilla TTS.

MelGAN is also included in the pwgan repo ^^.
Just use one of the melgan.yaml configs.

Oh i see. Really appreciate it. Thanks a lot.

@petertsengruihon in case it’s useful, there’s a bit of detail on using MelGAN here: My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.

@nmstoker Thanks a lot. What you posted is really helpful. Not until i read your post and then i found pwgan would be a great shot.

1 Like

Is there any pretrained multispeaker model we can get ahold of anywhere? I am running some tests and would like to save some time training.

Here is the model i trained. 10 speaker, german
Hope it helps.

1 Like

whould be alright if I put it our models page?

I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training. :grinning:

2 Likes

thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json

Can you share your config.json (and any modifications to symbols.py etc.)? It would be highly appreciated.

1 Like

Just add ‘äöüßÄÖÜ’ to _characters in symbols.py, mine looks like this.

‘ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüßÄÖÜ!’(),-.:;? ’

config.zip (3,2 KB)
For the config I just used the basic_cleaners and of course disabled phonemes.
I don’t have any abbreviations in my dataset and already expanded all the numbers, so the basic_cleaner is enough.

1 Like

Sounds amazing! Would be really interesting to see / hear where it develops.

Hi,

just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of

speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items

by chance all the evaluation items are pulled from speaker 1. Does it matter?

code for splitting in generic_utils.py

def split_dataset(items):
    is_multi_speaker = False
    speakers = [item[-1] for item in items]
    is_multi_speaker = len(set(speakers)) > 1
    eval_split_size = 500 if len(items) * 0.01 > 500 else int(
        len(items) * 0.01)
    np.random.seed(0)
    np.random.shuffle(items)
    if is_multi_speaker:
        items_eval = []
        # most stupid code ever -- Fix it !
        while len(items_eval) < eval_split_size:
            speakers = [item[-1] for item in items]
            speaker_counter = Counter(speakers)
            item_idx = np.random.randint(0, len(items))
            if speaker_counter[items[item_idx][-1]] > 1:
                items_eval.append(items[item_idx])
                del items[item_idx]
        return items_eval, items
    else:
        return items[:eval_split_size], items[eval_split_size:]

yes it matters. How would you check the model performance on other speakers in that case?

well yeah thats what i was thinking :smiley: its just that currently there is a chance that a speaker wont get any evaluation data.

Something like this should work i guess? Calculate eval for every speaker seperatly.

if is_multi_speaker:
    speaker_list = [item[-1] for item in items]
    speaker_list = set(speaker_list)

    items_eval = []
    items_train = []
    for speaker in speaker_list:
        temp_item_list = []
        for item in items:
            if speaker in item[-1]:
                temp_item_list.append(item)
        eval_split_size = 500 if len(temp_item_list) * 0.01 > 500 else int(
            len(temp_item_list) * 0.01)
        temp_eval = temp_item_list[ : eval_split_size]
        temp_test = temp_item_list[eval_split_size : ]
        items_eval.extend(temp_eval)
        items_train.extend(temp_test)
    return items_eval, items_train

The current splitting method in TTS should handle that if there is nor bug :slight_smile:

Send a PR please if you see any mistakes