Train Multispeaker Dataset + WaveRNN

nmstoker · March 17, 2020, 2:38am

@petertsengruihon in case it’s useful, there’s a bit of detail on using MelGAN here: My latest results using private dataset trained Tacotron2 model with MelGAN vocoder

For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.

petertsengruihon · March 17, 2020, 2:47am

@nmstoker Thanks a lot. What you posted is really helpful. Not until i read your post and then i found pwgan would be a great shot.

georroussos · March 17, 2020, 9:23am

Is there any pretrained multispeaker model we can get ahold of anywhere? I am running some tests and would like to save some time training.

sanjaesc · March 17, 2020, 12:03pm

Here is the model i trained. 10 speaker, german
Hope it helps.

erogol · March 17, 2020, 5:31pm

whould be alright if I put it our models page?

sanjaesc · March 17, 2020, 7:45pm

I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training.

erogol · March 18, 2020, 1:38am

thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json

dkreutz · March 18, 2020, 8:27am

Can you share your config.json (and any modifications to symbols.py etc.)? It would be highly appreciated.

sanjaesc · March 18, 2020, 10:02am

Just add ‘äöüßÄÖÜ’ to _characters in symbols.py, mine looks like this.

'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüßÄÖÜ!'(),-.:;? ’

config.zip (3,2 KB)
For the config I just used the basic_cleaners and of course disabled phonemes.
I don’t have any abbreviations in my dataset and already expanded all the numbers, so the basic_cleaner is enough.

andremaha · April 1, 2020, 2:02pm

Sounds amazing! Would be really interesting to see / hear where it develops.

sanjaesc · April 17, 2020, 9:09am

Hi,

just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of

speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items

by chance all the evaluation items are pulled from speaker 1. Does it matter?

code for splitting in generic_utils.py

def split_dataset(items):
    is_multi_speaker = False
    speakers = [item[-1] for item in items]
    is_multi_speaker = len(set(speakers)) > 1
    eval_split_size = 500 if len(items) * 0.01 > 500 else int(
        len(items) * 0.01)
    np.random.seed(0)
    np.random.shuffle(items)
    if is_multi_speaker:
        items_eval = []
        # most stupid code ever -- Fix it !
        while len(items_eval) < eval_split_size:
            speakers = [item[-1] for item in items]
            speaker_counter = Counter(speakers)
            item_idx = np.random.randint(0, len(items))
            if speaker_counter[items[item_idx][-1]] > 1:
                items_eval.append(items[item_idx])
                del items[item_idx]
        return items_eval, items
    else:
        return items[:eval_split_size], items[eval_split_size:]

erogol · April 17, 2020, 10:08am

yes it matters. How would you check the model performance on other speakers in that case?

sanjaesc · April 17, 2020, 11:03am

well yeah thats what i was thinking its just that currently there is a chance that a speaker wont get any evaluation data.

Something like this should work i guess? Calculate eval for every speaker seperatly.

if is_multi_speaker:
    speaker_list = [item[-1] for item in items]
    speaker_list = set(speaker_list)

    items_eval = []
    items_train = []
    for speaker in speaker_list:
        temp_item_list = []
        for item in items:
            if speaker in item[-1]:
                temp_item_list.append(item)
        eval_split_size = 500 if len(temp_item_list) * 0.01 > 500 else int(
            len(temp_item_list) * 0.01)
        temp_eval = temp_item_list[ : eval_split_size]
        temp_test = temp_item_list[eval_split_size : ]
        items_eval.extend(temp_eval)
        items_train.extend(temp_test)
    return items_eval, items_train

erogol · April 18, 2020, 10:48am

The current splitting method in TTS should handle that if there is nor bug

erogol · April 18, 2020, 10:48am

Send a PR please if you see any mistakes

sanjaesc · June 23, 2020, 8:02am

Hi everyone,

decided on giving a short update on the current status.
In the past months I was trying out different configurations of Mozilla TTS.

Trained:

T1 single-speaker/ multi-speaker models. (Both models worked quite well.)
T1 single-speaker/ multi-speaker models with GST. (multi-speaker with GST didnt really work.)
T2 single-speaker model. (This felt most human-like)

The goal was to train a multi-speaker model with GST support.
So i extended the tacotron2 model with support for speaker-embeddings and gst using Mellotron from Nvidia as guideline.

Instead of summing the Embeddings I concatenate them. Ref. Mellotron.

From my personal point of view this has led to much better results in training a t2 multi-speaker model with gst support.

Currently I’m training a model with 31 speakers of which some have only 10min of training data. Still the results are outstanding!

Here are some samples:
Soundcloud

My fork: Link

Ethan1 · June 27, 2020, 9:08am

@sanjaesc Nice! Did you ever fix the problem you had above with wavernn? Im running into the same problem with input arrays shape

erogol · July 7, 2020, 4:02pm

results sounds really good. With a vocoder intact that would be perfect. Do you have a plan to send a PR on that? Also @edresson1 would be interested to see these results.

edresson1 · July 7, 2020, 5:08pm

@sanjaesc In my experiments, I did something very similar, but I used external embeddings, GST did the same way following Mellotron too. In my multi-speaker experiments I got better results with “original” attention, “Graves” attention didn’t sound very good! Did you try Graves attention?

sanjaesc · July 8, 2020, 11:04am

Hey sorry for the late reply. I didnt really invest more time in WaveRNN, but I think it has something to do with short samples. You can try by removing those. Sorry can’t really help you here .

Topic		Replies	Views
Universal / multi-speaker vocoders TTS (Text-to-Speech)	7	1597	December 8, 2020
Training a universal vocoder TTS (Text-to-Speech) participation	25	2922	August 18, 2020
Two questions about Multi Speaker models TTS (Text-to-Speech)	3	654	April 7, 2020
Multispeaker versus transfer learning TTS (Text-to-Speech)	10	1318	October 5, 2020
Multiple voices dataset TTS (Text-to-Speech) learning , feedback , dataset	3	1364	May 22, 2020

Train Multispeaker Dataset + WaveRNN

Related topics