Train Multispeaker Dataset + WaveRNN

petertsengruihon · March 14, 2020, 7:46am

@sanjaesc thanks sanjaesc. I have tried to train on 10 bits yesterday too. Hope that would work. Would you mind sharing your log ?

sanjaesc · March 14, 2020, 5:32pm

I’m currently experimenting with pwgan and melgan. Afterwards I’ll retrain the tts model using graphemes and give wavernn a try.

petertsengruihon · March 16, 2020, 2:42am

Thanks @sanjaesc. I have also tried pwgan too based on erogol project. May i ask which melgan repo that you tried ? I have tried this melgan repo but hop size could not change into 256, which is correspond to mozilla TTS.

sanjaesc · March 16, 2020, 7:34am

MelGAN is also included in the pwgan repo ^^.
Just use one of the melgan.yaml configs.

petertsengruihon · March 16, 2020, 7:51am

Oh i see. Really appreciate it. Thanks a lot.

nmstoker · March 17, 2020, 2:38am

@petertsengruihon in case it’s useful, there’s a bit of detail on using MelGAN here: My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.

petertsengruihon · March 17, 2020, 2:47am

@nmstoker Thanks a lot. What you posted is really helpful. Not until i read your post and then i found pwgan would be a great shot.

georroussos · March 17, 2020, 9:23am

Is there any pretrained multispeaker model we can get ahold of anywhere? I am running some tests and would like to save some time training.

sanjaesc · March 17, 2020, 12:03pm

Here is the model i trained. 10 speaker, german
Hope it helps.

erogol · March 17, 2020, 5:31pm

whould be alright if I put it our models page?

sanjaesc · March 17, 2020, 7:45pm

I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training.

erogol · March 18, 2020, 1:38am

thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json

dkreutz · March 18, 2020, 8:27am

Can you share your config.json (and any modifications to symbols.py etc.)? It would be highly appreciated.

sanjaesc · March 18, 2020, 10:02am

Just add ‘äöüßÄÖÜ’ to _characters in symbols.py, mine looks like this.

‘ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüßÄÖÜ!’(),-.:;? ’

config.zip (3,2 KB)
For the config I just used the basic_cleaners and of course disabled phonemes.
I don’t have any abbreviations in my dataset and already expanded all the numbers, so the basic_cleaner is enough.

andremaha · April 1, 2020, 2:02pm

Sounds amazing! Would be really interesting to see / hear where it develops.

sanjaesc · April 17, 2020, 9:13am

Hi,

just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of

speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items

by chance all the evaluation items are pulled from speaker 1. Does it matter?

code for splitting in generic_utils.py

def split_dataset(items):
    is_multi_speaker = False
    speakers = [item[-1] for item in items]
    is_multi_speaker = len(set(speakers)) > 1
    eval_split_size = 500 if len(items) * 0.01 > 500 else int(
        len(items) * 0.01)
    np.random.seed(0)
    np.random.shuffle(items)
    if is_multi_speaker:
        items_eval = []
        # most stupid code ever -- Fix it !
        while len(items_eval) < eval_split_size:
            speakers = [item[-1] for item in items]
            speaker_counter = Counter(speakers)
            item_idx = np.random.randint(0, len(items))
            if speaker_counter[items[item_idx][-1]] > 1:
                items_eval.append(items[item_idx])
                del items[item_idx]
        return items_eval, items
    else:
        return items[:eval_split_size], items[eval_split_size:]

erogol · April 17, 2020, 10:08am

yes it matters. How would you check the model performance on other speakers in that case?

sanjaesc · April 17, 2020, 11:07am

well yeah thats what i was thinking its just that currently there is a chance that a speaker wont get any evaluation data.

Something like this should work i guess? Calculate eval for every speaker seperatly.

if is_multi_speaker:
    speaker_list = [item[-1] for item in items]
    speaker_list = set(speaker_list)

    items_eval = []
    items_train = []
    for speaker in speaker_list:
        temp_item_list = []
        for item in items:
            if speaker in item[-1]:
                temp_item_list.append(item)
        eval_split_size = 500 if len(temp_item_list) * 0.01 > 500 else int(
            len(temp_item_list) * 0.01)
        temp_eval = temp_item_list[ : eval_split_size]
        temp_test = temp_item_list[eval_split_size : ]
        items_eval.extend(temp_eval)
        items_train.extend(temp_test)
    return items_eval, items_train

erogol · April 18, 2020, 10:48am

The current splitting method in TTS should handle that if there is nor bug

erogol · April 18, 2020, 10:48am

Send a PR please if you see any mistakes