I trained up a 500 speaker libritts subset that sounds reasonable with griffin-lim with Tacotron2:
I heard some of the samples from WaveRNN and they really seemed to get rid of the borg-effect, but it sounds terrible when I try:
(note LibrtiTTS sounds much better with Tacotron 1 and griffin-lim in my hands, different sentence but in case you want to hear: https://drive.google.com/open?id=1qNStllZqMwRPIR0vgNhMfMQg7-DeAdEW)
I am using the universal vocoder model on the WaveRNN github and trained with the same audio params (below). Can anyone tell what I am doing wrong?
TTS audio config:
{
“github_branch”:"* dev",
“model”: “Tacotron2”, // one of the model in models/
“run_name”: “l500p_wrnn”,
“run_description”: “tacotron with large libritts subset and reagan”,
// AUDIO PARAMETERS
"audio":{
// Audio processing parameters
"num_mels": 80, // size of the mel spec frame.
"num_freq": 1025, // number of stft frequency levels. Size of the linear spectogram frame.
"sample_rate": 16000, // DATASET-RELATED: wav sample-rate. If different than the original data, it i\
s resampled.
“win_length”: 1024, // stft window length in ms.
“hop_length”: 200, // stft window hop-lengh in ms.
“preemphasis”: 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -
pre-emphasis.
“frame_length_ms”: 50, // stft window length in ms.If null, ‘win_length’ is used.
“frame_shift_ms”: 12.5, // stft window hop-lengh in ms. If null, ‘hop_length’ is used.
“min_level_db”: -100, // normalization range
“ref_level_db”: 20, // reference level db, theoretically 20db is the sound of air.
“power”: 1.5, // value to sharpen wav signals after GL algorithm.
“griffin_lim_iters”: 60,// #griffin-lim iterations. 30-60 is a good range. Large
“signal_norm”: true, // normalize the spec values in range [0, 1]
“symmetric_norm”: false, // move normalization to range [-1, 1]
“max_norm”: 1.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
“clip_norm”: true, // clip normalized values into the range.
“mel_fmin”: 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. T
une for dataset!!
“mel_fmax”: 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
“do_trim_silence”: true, // enable trimming of slience of audio as you load it. LJspeech (false), TWE
B (false), Nancy (true)
“trim_db”: 60 // threshold for timming silence. Set this according to your dataset.
},
config_16K.json from WaveRNN
{
“model_name”: “libriTTS-360”,
“model_description”: "Training a universal vocoder. finetune 24000 sr model
for 16K sr ",
"audio":{
// Audio processing parameters
"num_mels": 80, // size of the mel spec frame.
"num_freq": 1025, // number of stft frequency levels. Size of the\
linear spectogram frame.
“sample_rate”: 16000, // DATASET-RELATED: wav sample-rate. If differe
nt than the original data, it is resampled.
“frame_length_ms”: 50, // stft window length in ms.
“frame_shift_ms”: 12.5, // stft window hop-lengh in ms.
“preemphasis”: 0.98, // pre-emphasis to reduce spec noise and make i
t more structured. If 0.0, no -pre-emphasis.
“min_level_db”: -100, // normalization range
“ref_level_db”: 20, // reference level db, theoretically 20db is th
e sound of air.
“power”: 1.5, // value to sharpen wav signals after GL algori
thm.
“griffin_lim_iters”: 60,// #griffin-lim iterations. 30-60 is a good ran
ge. Larger the value, slower the generation.
"signal_norm": true, // normalize the spec values in range [0, 1]
"symmetric_norm": false, // move normalization to range [-1, 1]
"max_norm": 1, // scale normalization to range [-max_norm, max\
_norm] or [0, max_norm]
“clip_norm”: true, // clip normalized values into the range.
“mel_fmin”: 0.0, // minimum freq level for mel-spec. ~50 for ma
le and ~95 for female voices. Tune for dataset!!
“mel_fmax”: 8000.0, // maximum freq level for mel-spec. Tune for
dataset!!
“do_trim_silence”: true // enable trimming of slience of audio as you
load it. LJspeech (false), TWEB (false), Nancy (true)