Trouble with WaveRNN universal vocoder model

maneeshkyadav · May 14, 2020, 12:36am

I trained up a 500 speaker libritts subset that sounds reasonable with griffin-lim with Tacotron2:

I heard some of the samples from WaveRNN and they really seemed to get rid of the borg-effect, but it sounds terrible when I try:

(note LibrtiTTS sounds much better with Tacotron 1 and griffin-lim in my hands, different sentence but in case you want to hear: https://drive.google.com/open?id=1qNStllZqMwRPIR0vgNhMfMQg7-DeAdEW)

I am using the universal vocoder model on the WaveRNN github and trained with the same audio params (below). Can anyone tell what I am doing wrong?

TTS audio config:
{
“github_branch”:"* dev",
“model”: “Tacotron2”, // one of the model in models/
“run_name”: “l500p_wrnn”,
“run_description”: “tacotron with large libritts subset and reagan”,

// AUDIO PARAMETERS
"audio":{
    // Audio processing parameters
    "num_mels": 80,         // size of the mel spec frame.
    "num_freq": 1025,       // number of stft frequency levels. Size of the linear spectogram frame.
    "sample_rate": 16000,   // DATASET-RELATED: wav sample-rate. If different than the original data, it i\

s resampled.
“win_length”: 1024, // stft window length in ms.
“hop_length”: 200, // stft window hop-lengh in ms.
“preemphasis”: 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -
pre-emphasis.
“frame_length_ms”: 50, // stft window length in ms.If null, ‘win_length’ is used.
“frame_shift_ms”: 12.5, // stft window hop-lengh in ms. If null, ‘hop_length’ is used.
“min_level_db”: -100, // normalization range
“ref_level_db”: 20, // reference level db, theoretically 20db is the sound of air.
“power”: 1.5, // value to sharpen wav signals after GL algorithm.
“griffin_lim_iters”: 60,// #griffin-lim iterations. 30-60 is a good range. Large
“signal_norm”: true, // normalize the spec values in range [0, 1]
“symmetric_norm”: false, // move normalization to range [-1, 1]
“max_norm”: 1.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
“clip_norm”: true, // clip normalized values into the range.
“mel_fmin”: 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. T
une for dataset!!
“mel_fmax”: 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
“do_trim_silence”: true, // enable trimming of slience of audio as you load it. LJspeech (false), TWE
B (false), Nancy (true)
“trim_db”: 60 // threshold for timming silence. Set this according to your dataset.
},

config_16K.json from WaveRNN
{
“model_name”: “libriTTS-360”,
“model_description”: "Training a universal vocoder. finetune 24000 sr model
for 16K sr ",

"audio":{
    // Audio processing parameters
    "num_mels": 80,         // size of the mel spec frame.
    "num_freq": 1025,       // number of stft frequency levels. Size of the\

linear spectogram frame.
“sample_rate”: 16000, // DATASET-RELATED: wav sample-rate. If differe
nt than the original data, it is resampled.
“frame_length_ms”: 50, // stft window length in ms.
“frame_shift_ms”: 12.5, // stft window hop-lengh in ms.
“preemphasis”: 0.98, // pre-emphasis to reduce spec noise and make i
t more structured. If 0.0, no -pre-emphasis.
“min_level_db”: -100, // normalization range
“ref_level_db”: 20, // reference level db, theoretically 20db is th
e sound of air.
“power”: 1.5, // value to sharpen wav signals after GL algori
thm.
“griffin_lim_iters”: 60,// #griffin-lim iterations. 30-60 is a good ran
ge. Larger the value, slower the generation.

   "signal_norm": true,    // normalize the spec values in range [0, 1]
    "symmetric_norm": false, // move normalization to range [-1, 1]
    "max_norm": 1,          // scale normalization to range [-max_norm, max\

_norm] or [0, max_norm]
“clip_norm”: true, // clip normalized values into the range.
“mel_fmin”: 0.0, // minimum freq level for mel-spec. ~50 for ma
le and ~95 for female voices. Tune for dataset!!
“mel_fmax”: 8000.0, // maximum freq level for mel-spec. Tune for
dataset!!
“do_trim_silence”: true // enable trimming of slience of audio as you
load it. LJspeech (false), TWEB (false), Nancy (true)

maneeshkyadav · May 18, 2020, 9:04pm

I tried to step back and just use the universal vocoder model on a LJSpeech trained TTS model (trained with the vocoder audio params) but the WaveRNN output still has that weird quality. I’ve heard some of the WaveRNN output and it sounds great, and my glim TTS model output sounds comparable.

Frustrating. I’ll got ahead and try training up my own WaveRNN model on 500 LibriTTS speakers…if anyone has any guesses as to what I am doing wrong I would love to hear it.

Topic		Replies	Views
Error while attempting to train WaveRNN TTS (Text-to-Speech)	2	693	April 6, 2020
Training a universal vocoder TTS (Text-to-Speech) participation	25	2921	August 18, 2020
Query regarding post processing TTS (Text-to-Speech)	49	2151	September 19, 2019
Tacotron 2 with ParallelWaveGAN. Next step TTS (Text-to-Speech)	30	3680	September 15, 2020
Custom WaveRNN Training Woes TTS (Text-to-Speech)	3	1385	August 23, 2019

Trouble with WaveRNN universal vocoder model

Related topics