I have been trying to finetune the LJSpeech PWGAN model to my single speaker, but it refuses to sound better. The voice has indeed changed, but it’s gotten to a quality threshold where it sounds deeper, with some noise and isn’t improving. I have been training for 2 days now – the loss doesn’t really seem to be improving, only spiking around. First I tried with a batch size of 4 and then I tried a combination of 2 and distributed training, to cut the time.
For future reference for anyone struggling; I found that not using v3 of the PWG discriminator does the trick. I have been able to get it to sound much better in under one day, using v1 The config I use is this.
format: "hdf5"
clip_norm: true
do_trim_silence: true
frame_length_ms: 50
frame_shift_ms: 12.5
max_norm: 4
hop_length: 275
win_length: 1100
mel_fmax: 8000.0
mel_fmin: 0.0
min_level_db: -100
num_freq: 1025
num_mels: 80
preemphasis: 0.98
ref_level_db: 20
sample_rate: 22050
signal_norm: true
sound_norm: false
symmetric_norm: true
trim_db: 60
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Kernel size of dilated convolution.
layers: 30 # Number of residual block layers.
stacks: 3 # Number of stacks i.e., dilation cycles.
residual_channels: 64 # Number of channels in residual conv.
gate_channels: 128 # Number of channels in gated conv.
skip_channels: 64 # Number of channels in skip conv.
aux_channels: 80 # Number of channels for auxiliary feature conv.
# Must be the same as num_mels.
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
upsample_params: # Upsampling network parameters.
- 5
- 5
- 11
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_size: 3 # Number of output channels.
layers: 10 # Number of conv layers.
conv_channels: 64 # Number of chnn layers.
bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU.
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
window: "hann_window" # Window function for STFT-based loss
lambda_adv: 4.0 # Loss balancing coefficient.
batch_size: 1 # Batch size.
batch_max_steps: 28875 # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true # Whether to pin memory in Pytorch DataLoader.
num_workers: 2 # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
lr: 0.0001 # Generator's learning rate.
eps: 1.0e-6 # Generator's epsilon.
weight_decay: 0.0 # Generator's weight decay coefficient.
step_size: 200000 # Generator's scheduler step size.
gamma: 0.5 # Generator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10 # Generator's gradient norm.
lr: 0.00005 # Discriminator's learning rate.
eps: 1.0e-6 # Discriminator's epsilon.
weight_decay: 0.0 # Discriminator's weight decay coefficient.
step_size: 200000 # Discriminator's scheduler step size.
gamma: 0.5 # Discriminator's scheduler gamma.
# At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 400000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
log_interval_steps: 100 # Interval steps to record the training log.
num_save_intermediate_results: 4 # Number of results to be saved as intermediate results.
It is a mix of an original config and @erogol’s tts configs found on the fork.
I am training on a private speaker and will let it train until it reaches the full 400k mark and then decide if I go with 1000k. My goal is that, if this works, I want to train a LibriTTS universal vocoder on 22050Hz and release it here. We will see how it goes
@georroussos can you please point me to the specific commit in mozilla/TTS repo for v1 of PWG discriminator.
I am not able to locate what you referred here.
Hi, I got it from the original repo, but it is a hassle to edit it to fit the mozilla fork, so I would instead train using the integrated vocoder module.