Train Multispeaker Dataset + WaveRNN

sanjaesc · April 17, 2020, 9:13am

Hi,

just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of

speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items

by chance all the evaluation items are pulled from speaker 1. Does it matter?

code for splitting in generic_utils.py

def split_dataset(items):
    is_multi_speaker = False
    speakers = [item[-1] for item in items]
    is_multi_speaker = len(set(speakers)) > 1
    eval_split_size = 500 if len(items) * 0.01 > 500 else int(
        len(items) * 0.01)
    np.random.seed(0)
    np.random.shuffle(items)
    if is_multi_speaker:
        items_eval = []
        # most stupid code ever -- Fix it !
        while len(items_eval) < eval_split_size:
            speakers = [item[-1] for item in items]
            speaker_counter = Counter(speakers)
            item_idx = np.random.randint(0, len(items))
            if speaker_counter[items[item_idx][-1]] > 1:
                items_eval.append(items[item_idx])
                del items[item_idx]
        return items_eval, items
    else:
        return items[:eval_split_size], items[eval_split_size:]

erogol · April 17, 2020, 10:08am

yes it matters. How would you check the model performance on other speakers in that case?

sanjaesc · April 17, 2020, 11:07am

well yeah thats what i was thinking its just that currently there is a chance that a speaker wont get any evaluation data.

Something like this should work i guess? Calculate eval for every speaker seperatly.

if is_multi_speaker:
    speaker_list = [item[-1] for item in items]
    speaker_list = set(speaker_list)

    items_eval = []
    items_train = []
    for speaker in speaker_list:
        temp_item_list = []
        for item in items:
            if speaker in item[-1]:
                temp_item_list.append(item)
        eval_split_size = 500 if len(temp_item_list) * 0.01 > 500 else int(
            len(temp_item_list) * 0.01)
        temp_eval = temp_item_list[ : eval_split_size]
        temp_test = temp_item_list[eval_split_size : ]
        items_eval.extend(temp_eval)
        items_train.extend(temp_test)
    return items_eval, items_train

erogol · April 18, 2020, 10:48am

The current splitting method in TTS should handle that if there is nor bug

erogol · April 18, 2020, 10:48am

Send a PR please if you see any mistakes

sanjaesc · June 23, 2020, 10:58am

Hi everyone,

decided on giving a short update on the current status.
In the past months I was trying out different configurations of Mozilla TTS.

Trained:

T1 single-speaker/ multi-speaker models. (Both models worked quite well.)
T1 single-speaker/ multi-speaker models with GST. (multi-speaker with GST didnt really work.)
T2 single-speaker model. (This felt most human-like)

The goal was to train a multi-speaker model with GST support.
So i extended the tacotron2 model with support for speaker-embeddings and gst using Mellotron from Nvidia as guideline.

Instead of summing the Embeddings I concatenate them. Ref. Mellotron.

From my personal point of view this has led to much better results in training a t2 multi-speaker model with gst support.

Currently I’m training a model with 31 speakers of which some have only 10min of training data. Still the results are outstanding!

Here are some samples:
Soundcloud

My fork: Link

Ethan1 · June 27, 2020, 9:08am

@sanjaesc Nice! Did you ever fix the problem you had above with wavernn? Im running into the same problem with input arrays shape

erogol · July 7, 2020, 4:02pm

results sounds really good. With a vocoder intact that would be perfect. Do you have a plan to send a PR on that? Also @edresson1 would be interested to see these results.

edresson1 · July 7, 2020, 5:08pm

@sanjaesc In my experiments, I did something very similar, but I used external embeddings, GST did the same way following Mellotron too. In my multi-speaker experiments I got better results with “original” attention, “Graves” attention didn’t sound very good! Did you try Graves attention?

sanjaesc · July 8, 2020, 11:17am

Hey sorry for the late reply. I didnt really invest more time in WaveRNN, but I think it has something to do with short samples. You can try by removing those. Sorry can’t really help you here .

sanjaesc · July 8, 2020, 11:15am

results sounds really good. With a vocoder intact that would be perfect.

I’m trying to train a multispeaker vocoder (with your newest implementation from the dev branch), currently at ~400k steps. It takes time ^^.

Do you have a plan to send a PR on that?

Can do that.

sanjaesc · July 8, 2020, 11:27am

Yeah I tried Graves attention, with and without bidirectional decoding. Had the same experience as you, the results sounded worse. From my experience I had the best results so far with the original attention and gst.

Currently I’m experimenting with the new DDC feature from the dev branch.

sanjaesc · July 13, 2020, 1:57pm

Here are the first results of the multi-speaker vocoder at 700k steps. Those samples are generated with the commit from the current master branch and the gst modifications i made.

https://soundcloud.com/sanjaesc-395770686/sets/gothic-tts-tacotron-2

Some sound better then others.
The two female voices sound rather bad.

Those are the audio parameters used. Any tips here?

// AUDIO PARAMETERS
"audio":{
    "num_freq": 1025,         // number of stft frequency levels. Size of the linear spectogram frame.
    "win_length": 1024,      // stft window length in ms.
    "hop_length": 256,       // stft window hop-lengh in ms.
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

    // Audio processing parameters
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
    "preemphasis": 0.98,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

    // Silence trimming
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
    "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

    // MelSpectrogram parameters
    "num_mels": 80,         // size of the mel spec frame.
    "mel_fmin": 40.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "spec_gain": 20.0,         // scaler value appplied after log transform of spectrogram.

    // Normalization parameters
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
    "min_level_db": -100,   // lower bound for normalization
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},

georroussos · July 13, 2020, 3:56pm

Are num_mels min and max the same on both TTS and vocoder models? The other day I trained a female TTS with a mel fmin of 80 and my vocoder had an mel fmin of 0 and it did not work.

Are you intending to release this vocoder? It sounds so good! I tried to train PWGan on LibriTTS, but unfortunately the results were very bad with a lot of static.

sanjaesc · July 13, 2020, 5:21pm

Yeah the TTS and Vocoder audio settings are the same. Sure I will upload the TTS model and vocoder once finished training. But the vocoder is only trained on data from my dataset, so I don’t think it will work well with oder models.

davidak · July 15, 2020, 3:22am

This is awesome! I love Gothic 2 and i think it’s one of the best games ever. Specifically because of the speech. Thanks for sharing.

sanjaesc · July 29, 2020, 11:29am

Question!

Did anyone try to train a multi-speaker model with the newly integrated vocoder module?
I’ve noticed that the spec_gain is set to 1 in the config, if I do so the model doesn’t seem to learn after the discriminator hits at 200k steps… it’s just gibberish from there on. With spec_gain set to 20 it works, so do i train with 20 and set it to 1 during inference?

Currently I am experimenting with the pwgan implementation, it’s at 600k steps. The results are not great so far.

Also would it make more sense to train on just the data i used during the TTS training? Because right now I also included some other speakers from the dataset I am working with, which were not used during TTS training.

georroussos · July 29, 2020, 3:51pm

How much speech for each speaker? All PWGan experiments I tried on LibriTTS were very bad. A lot of noise and static. 6 speakers with 10 hours each are giving me much better results and the same level of generalization.

sanjaesc · July 29, 2020, 6:23pm

It varies pretty hard from some hours per speaker to some minutes. The audio has overall good quality.

I will see if training only on the data used during TTS training improves the quality in any way.

Did you try the native vocoder implementation?

TheDayAfter · September 25, 2020, 2:47pm

Really impressive.

Above google drive link to the models does not work anymore. Do you plan to provide newest models and notebooks to run and test them? Thank you.