Train Multispeaker Dataset + WaveRNN

sanjaesc · July 8, 2020, 11:15am

results sounds really good. With a vocoder intact that would be perfect.

I’m trying to train a multispeaker vocoder (with your newest implementation from the dev branch), currently at ~400k steps. It takes time ^^.

Do you have a plan to send a PR on that?

Can do that.

sanjaesc · July 8, 2020, 11:27am

Yeah I tried Graves attention, with and without bidirectional decoding. Had the same experience as you, the results sounded worse. From my experience I had the best results so far with the original attention and gst.

Currently I’m experimenting with the new DDC feature from the dev branch.

sanjaesc · July 13, 2020, 1:57pm

Here are the first results of the multi-speaker vocoder at 700k steps. Those samples are generated with the commit from the current master branch and the gst modifications i made.

https://soundcloud.com/sanjaesc-395770686/sets/gothic-tts-tacotron-2

Some sound better then others.
The two female voices sound rather bad.

Those are the audio parameters used. Any tips here?

// AUDIO PARAMETERS
"audio":{
    "num_freq": 1025,         // number of stft frequency levels. Size of the linear spectogram frame.
    "win_length": 1024,      // stft window length in ms.
    "hop_length": 256,       // stft window hop-lengh in ms.
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

    // Audio processing parameters
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
    "preemphasis": 0.98,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

    // Silence trimming
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
    "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

    // MelSpectrogram parameters
    "num_mels": 80,         // size of the mel spec frame.
    "mel_fmin": 40.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "spec_gain": 20.0,         // scaler value appplied after log transform of spectrogram.

    // Normalization parameters
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
    "min_level_db": -100,   // lower bound for normalization
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},

georroussos · July 13, 2020, 3:56pm

Are num_mels min and max the same on both TTS and vocoder models? The other day I trained a female TTS with a mel fmin of 80 and my vocoder had an mel fmin of 0 and it did not work.

Are you intending to release this vocoder? It sounds so good! I tried to train PWGan on LibriTTS, but unfortunately the results were very bad with a lot of static.

sanjaesc · July 13, 2020, 5:21pm

Yeah the TTS and Vocoder audio settings are the same. Sure I will upload the TTS model and vocoder once finished training. But the vocoder is only trained on data from my dataset, so I don’t think it will work well with oder models.

davidak · July 15, 2020, 3:22am

This is awesome! I love Gothic 2 and i think it’s one of the best games ever. Specifically because of the speech. Thanks for sharing.

sanjaesc · July 29, 2020, 11:29am

Question!

Did anyone try to train a multi-speaker model with the newly integrated vocoder module?
I’ve noticed that the spec_gain is set to 1 in the config, if I do so the model doesn’t seem to learn after the discriminator hits at 200k steps… it’s just gibberish from there on. With spec_gain set to 20 it works, so do i train with 20 and set it to 1 during inference?

Currently I am experimenting with the pwgan implementation, it’s at 600k steps. The results are not great so far.

Also would it make more sense to train on just the data i used during the TTS training? Because right now I also included some other speakers from the dataset I am working with, which were not used during TTS training.

georroussos · July 29, 2020, 3:51pm

How much speech for each speaker? All PWGan experiments I tried on LibriTTS were very bad. A lot of noise and static. 6 speakers with 10 hours each are giving me much better results and the same level of generalization.

sanjaesc · July 29, 2020, 6:23pm

It varies pretty hard from some hours per speaker to some minutes. The audio has overall good quality.

I will see if training only on the data used during TTS training improves the quality in any way.

Did you try the native vocoder implementation?

TheDayAfter · September 25, 2020, 2:47pm

Really impressive.

Above google drive link to the models does not work anymore. Do you plan to provide newest models and notebooks to run and test them? Thank you.

Ravi_Dewangan · October 5, 2020, 8:08pm

Hey. Just wanted to confirm some things for LibriTTS. As the speaker ids ranges till 10000 and there are only 2456 speakers, do i have to remap each original speaker_id to 1-2456 range for speaker embeding vocab ? I think this should be done but still newbie so asking.