Train Multispeaker Dataset + WaveRNN

Hi,

just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of

speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items

by chance all the evaluation items are pulled from speaker 1. Does it matter?

code for splitting in generic_utils.py

def split_dataset(items):
    is_multi_speaker = False
    speakers = [item[-1] for item in items]
    is_multi_speaker = len(set(speakers)) > 1
    eval_split_size = 500 if len(items) * 0.01 > 500 else int(
        len(items) * 0.01)
    np.random.seed(0)
    np.random.shuffle(items)
    if is_multi_speaker:
        items_eval = []
        # most stupid code ever -- Fix it !
        while len(items_eval) < eval_split_size:
            speakers = [item[-1] for item in items]
            speaker_counter = Counter(speakers)
            item_idx = np.random.randint(0, len(items))
            if speaker_counter[items[item_idx][-1]] > 1:
                items_eval.append(items[item_idx])
                del items[item_idx]
        return items_eval, items
    else:
        return items[:eval_split_size], items[eval_split_size:]

yes it matters. How would you check the model performance on other speakers in that case?

well yeah thats what i was thinking :smiley: its just that currently there is a chance that a speaker wont get any evaluation data.

Something like this should work i guess? Calculate eval for every speaker seperatly.

if is_multi_speaker:
    speaker_list = [item[-1] for item in items]
    speaker_list = set(speaker_list)

    items_eval = []
    items_train = []
    for speaker in speaker_list:
        temp_item_list = []
        for item in items:
            if speaker in item[-1]:
                temp_item_list.append(item)
        eval_split_size = 500 if len(temp_item_list) * 0.01 > 500 else int(
            len(temp_item_list) * 0.01)
        temp_eval = temp_item_list[ : eval_split_size]
        temp_test = temp_item_list[eval_split_size : ]
        items_eval.extend(temp_eval)
        items_train.extend(temp_test)
    return items_eval, items_train

The current splitting method in TTS should handle that if there is nor bug :slight_smile:

Send a PR please if you see any mistakes

Hi everyone,

decided on giving a short update on the current status.
In the past months I was trying out different configurations of Mozilla TTS.

Trained:

  • T1 single-speaker/ multi-speaker models. (Both models worked quite well.)
  • T1 single-speaker/ multi-speaker models with GST. (multi-speaker with GST didnt really work.)
  • T2 single-speaker model. (This felt most human-like)

The goal was to train a multi-speaker model with GST support.
So i extended the tacotron2 model with support for speaker-embeddings and gst using Mellotron from Nvidia as guideline.

Instead of summing the Embeddings I concatenate them. Ref. Mellotron.

From my personal point of view this has led to much better results in training a t2 multi-speaker model with gst support.

Currently I’m training a model with 31 speakers of which some have only 10min of training data. Still the results are outstanding!

Here are some samples:
Soundcloud

My fork: Link

5 Likes

@sanjaesc Nice! Did you ever fix the problem you had above with wavernn? Im running into the same problem with input arrays shape

results sounds really good. With a vocoder intact that would be perfect. Do you have a plan to send a PR on that? Also @edresson1 would be interested to see these results.

@sanjaesc In my experiments, I did something very similar, but I used external embeddings, GST did the same way following Mellotron too. In my multi-speaker experiments I got better results with “original” attention, “Graves” attention didn’t sound very good! Did you try Graves attention?

Hey sorry for the late reply. I didnt really invest more time in WaveRNN, but I think it has something to do with short samples. You can try by removing those. Sorry can’t really help you here :sweat_smile:.

results sounds really good. With a vocoder intact that would be perfect.

I’m trying to train a multispeaker vocoder (with your newest implementation from the dev branch), currently at ~400k steps. It takes time ^^.

Do you have a plan to send a PR on that?

Can do that.

3 Likes

Yeah I tried Graves attention, with and without bidirectional decoding. Had the same experience as you, the results sounded worse. From my experience I had the best results so far with the original attention and gst.

Currently I’m experimenting with the new DDC feature from the dev branch.

1 Like

Here are the first results of the multi-speaker vocoder at 700k steps. Those samples are generated with the commit from the current master branch and the gst modifications i made.

https://soundcloud.com/sanjaesc-395770686/sets/gothic-tts-tacotron-2

Some sound better then others.
The two female voices sound rather bad.

Those are the audio parameters used. Any tips here?

// AUDIO PARAMETERS
"audio":{
    "num_freq": 1025,         // number of stft frequency levels. Size of the linear spectogram frame.
    "win_length": 1024,      // stft window length in ms.
    "hop_length": 256,       // stft window hop-lengh in ms.
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

    // Audio processing parameters
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
    "preemphasis": 0.98,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

    // Silence trimming
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
    "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

    // MelSpectrogram parameters
    "num_mels": 80,         // size of the mel spec frame.
    "mel_fmin": 40.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "spec_gain": 20.0,         // scaler value appplied after log transform of spectrogram.

    // Normalization parameters
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
    "min_level_db": -100,   // lower bound for normalization
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
1 Like

Are num_mels min and max the same on both TTS and vocoder models? The other day I trained a female TTS with a mel fmin of 80 and my vocoder had an mel fmin of 0 and it did not work.

Are you intending to release this vocoder? It sounds so good! I tried to train PWGan on LibriTTS, but unfortunately the results were very bad with a lot of static.

Yeah the TTS and Vocoder audio settings are the same. Sure I will upload the TTS model and vocoder once finished training. But the vocoder is only trained on data from my dataset, so I don’t think it will work well with oder models.

This is awesome! I love Gothic 2 and i think it’s one of the best games ever. Specifically because of the speech. Thanks for sharing.

Question!

Did anyone try to train a multi-speaker model with the newly integrated vocoder module?
I’ve noticed that the spec_gain is set to 1 in the config, if I do so the model doesn’t seem to learn after the discriminator hits at 200k steps… it’s just gibberish from there on. With spec_gain set to 20 it works, so do i train with 20 and set it to 1 during inference?

Currently I am experimenting with the pwgan implementation, it’s at 600k steps. The results are not great so far.

Also would it make more sense to train on just the data i used during the TTS training? Because right now I also included some other speakers from the dataset I am working with, which were not used during TTS training.

1 Like

How much speech for each speaker? All PWGan experiments I tried on LibriTTS were very bad. A lot of noise and static. 6 speakers with 10 hours each are giving me much better results and the same level of generalization.

It varies pretty hard from some hours per speaker to some minutes. The audio has overall good quality.

I will see if training only on the data used during TTS training improves the quality in any way.

Did you try the native vocoder implementation?

Really impressive.

Above google drive link to the models does not work anymore. Do you plan to provide newest models and notebooks to run and test them? Thank you.