I have been trying to train a universal ParallelWaveGAN for a long while now but I have not really had any success. I first started with a private single speaker and it worked very well with no background noise, so I thought I would try my luck with LibriTTS. My understanding was that the network would be exposed to many, many more waveforms, so it would be able to produce speech for any speaker. The models I have been training have all been outputting sound with artifacts and background noise, like a static “zzzz”. I thought I could introduce some more private speaker to LibriTTS, so I did (sum was Libri-360 and 200 more speakers). The results did not really change. At best, they sound muffled. I would never expect WaveRNN quality, but I would at least expect single speaker quality. I also do not know what to make of my tensorboard logs, but I think they do look pretty erratic.
format: "hdf5"
audio:
clip_norm: true
do_trim_silence: false
frame_length_ms: 50
frame_shift_ms: 12.5
max_norm: 4
hop_length: 275
win_length: 1100
mel_fmax: 8000.0
mel_fmin: 0.0
min_level_db: -100
num_freq: 1025
num_mels: 80
preemphasis: 0.98
ref_level_db: 20
sample_rate: 22050
signal_norm: true
sound_norm: false
symmetric_norm: true
trim_db: 20
generator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Kernel size of dilated convolution.
layers: 30 # Number of residual block layers.
stacks: 3 # Number of stacks i.e., dilation cycles.
residual_channels: 64 # Number of channels in residual conv.
gate_channels: 128 # Number of channels in gated conv.
skip_channels: 64 # Number of channels in skip conv.
aux_channels: 80 # Number of channels for auxiliary feature conv.
# Must be the same as num_mels.
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
upsample_params: # Upsampling network parameters.
upsample_scales:
- 5
- 5
- 11
discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Number of output channels.
layers: 10 # Number of conv layers.
conv_channels: 64 # Number of chnn layers.
bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU.
stft_loss_params:
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
window: "hann_window" # Window function for STFT-based loss
lambda_adv: 4.0 # Loss balancing coefficient.
batch_size: 8 # Batch size.
batch_max_steps: 26125 # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true # Whether to pin memory in Pytorch DataLoader.
num_workers: 8 # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: false # Whether to allow cache in dataset. If true, it requires cpu memory.
generator_optimizer_params:
lr: 0.0001 # Generator's learning rate.
eps: 1.0e-6 # Generator's epsilon.
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
step_size: 200000 # Generator's scheduler step size.
gamma: 0.5 # Generator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10 # Generator's gradient norm.
discriminator_optimizer_params:
lr: 0.00005 # Discriminator's learning rate.
eps: 1.0e-6 # Discriminator's epsilon.
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
step_size: 200000 # Discriminator's scheduler step size.
gamma: 0.5 # Discriminator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 1000000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
log_interval_steps: 100 # Interval steps to record the training log.
num_save_intermediate_results: 4 # Number of results to be saved as intermediate results.
I have downsampled the dataset and changed the hop and win sizes to accomodate with my TTS attributes.