I cannot get PWGan to converge/sound good

Hi,

I have been trying to finetune the LJSpeech PWGAN model to my single speaker, but it refuses to sound better. The voice has indeed changed, but it’s gotten to a quality threshold where it sounds deeper, with some noise and isn’t improving. I have been training for 2 days now – the loss doesn’t really seem to be improving, only spiking around. First I tried with a batch size of 4 and then I tried a combination of 2 and distributed training, to cut the time.

Does anyone have any ideas?

allow_cache: true
format: "hdf5"
audio:
  clip_norm: true
  do_trim_silence: true
  frame_length_ms: 50
  frame_shift_ms: 12.5
  max_norm: 4
  hop_length: 275
  win_length: 1100
  mel_fmax: 8000.0
  mel_fmin: 0.0
  min_level_db: -100
  num_freq: 1025
  num_mels: 80
  preemphasis: 0.98
  ref_level_db: 20
  sample_rate: 22050
  signal_norm: true
  sound_norm: false
  symmetric_norm: true
  trim_db: 60
augment: false
batch_max_steps: 28875
batch_size: 2
discriminator_grad_norm: 1
discriminator_optimizer_params:
  eps: 1.0e-06
  lr: 5.0e-05
  weight_decay: 0.0
discriminator_params:
  dropout: 0.0
  gate_channels: 64
  in_channels: 1
  kernel_size: 3
  layers: 30
  nonlinear_activation: LeakyReLU
  nonlinear_activation_params:
    negative_slope: 0.2
  out_channels: 1
  residual_channels: 32
  skip_channels: 32
  stacks: 3
  use_weight_norm: true
discriminator_scheduler_params:
  gamma: 0.5
  step_size: 200000
discriminator_train_start_steps: 100000
discriminator_type: ResidualParallelWaveGANDiscriminator
distributed: false
eval_interval_steps: 1000
generator_grad_norm: 10
generator_optimizer_params:
  eps: 1.0e-06
  lr: 0.001
  weight_decay: 0.0
generator_params:
  aux_channels: 80
  aux_context_window: 2
  dropout: 0.0
  gate_channels: 128
  in_channels: 1
  kernel_size: 3
  layers: 30
  out_channels: 1
  residual_channels: 64
  skip_channels: 64
  stacks: 3
  upsample_net: ConvInUpsampleNetwork
  upsample_params:
    upsample_scales:
    - 5
    - 5
    - 11
  use_weight_norm: true
generator_scheduler_params:
  gamma: 0.5
  step_size: 200000
generator_type: ParallelWaveGANGenerator
hop_size: 275
lambda_adv: 4.0
log_interval_steps: 100
num_save_intermediate_results: 4
num_workers: 2
pin_memory: true
rank: 0
remove_short_samples: true
save_interval_steps: 2000
train_max_steps: 1000000
verbose: 1

###########################################################
#                   STFT LOSS SETTING                     #
###########################################################
stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss

For future reference for anyone struggling; I found that not using v3 of the PWG discriminator does the trick. I have been able to get it to sound much better in under one day, using v1 :slight_smile: The config I use is this.

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
format: "hdf5"
audio:
  clip_norm: true
  do_trim_silence: true
  frame_length_ms: 50
  frame_shift_ms: 12.5
  max_norm: 4
  hop_length: 275
  win_length: 1100
  mel_fmax: 8000.0
  mel_fmin: 0.0
  min_level_db: -100
  num_freq: 1025
  num_mels: 80
  preemphasis: 0.98
  ref_level_db: 20
  sample_rate: 22050
  signal_norm: true
  sound_norm: false
  symmetric_norm: true
  trim_db: 60

###########################################################
#         GENERATOR NETWORK ARCHITECTURE SETTING          #
###########################################################
generator_params:
    in_channels: 1        # Number of input channels.
    out_channels: 1       # Number of output channels.
    kernel_size: 3        # Kernel size of dilated convolution.
    layers: 30            # Number of residual block layers.
    stacks: 3             # Number of stacks i.e., dilation cycles.
    residual_channels: 64 # Number of channels in residual conv.
    gate_channels: 128    # Number of channels in gated conv.
    skip_channels: 64     # Number of channels in skip conv.
    aux_channels: 80      # Number of channels for auxiliary feature conv.
                          # Must be the same as num_mels.
    aux_context_window: 2 # Context window size for auxiliary feature.
                          # If set to 2, previous 2 and future 2 frames will be considered.
    dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
    upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
    upsample_params:                      # Upsampling network parameters.
      upsample_scales:
      - 5
      - 5
      - 11

###########################################################
#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
###########################################################
discriminator_params:
    in_channels: 1        # Number of input channels.
    out_channels: 1       # Number of output channels.
    kernel_size: 3        # Number of output channels.
    layers: 10            # Number of conv layers.
    conv_channels: 64     # Number of chnn layers.
    bias: true            # Whether to use bias parameter in conv.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
    nonlinear_activation_params:      # Nonlinear function parameters
        negative_slope: 0.2           # Alpha in LeakyReLU.

###########################################################
#                   STFT LOSS SETTING                     #
###########################################################
stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss

###########################################################
#               ADVERSARIAL LOSS SETTING                  #
###########################################################
lambda_adv: 4.0  # Loss balancing coefficient.

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 1              # Batch size.
batch_max_steps: 28875     # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true           # Whether to pin memory in Pytorch DataLoader.
num_workers: 2             # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true          # Whether to allow cache in dataset. If true, it requires cpu memory.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_params:
    lr: 0.0001             # Generator's learning rate.
    eps: 1.0e-6            # Generator's epsilon.
    weight_decay: 0.0      # Generator's weight decay coefficient.
generator_scheduler_params:
    step_size: 200000      # Generator's scheduler step size.
    gamma: 0.5             # Generator's scheduler gamma.
                           # At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10    # Generator's gradient norm.
discriminator_optimizer_params:
    lr: 0.00005            # Discriminator's learning rate.
    eps: 1.0e-6            # Discriminator's epsilon.
    weight_decay: 0.0      # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
    step_size: 200000      # Discriminator's scheduler step size.
    gamma: 0.5             # Discriminator's scheduler gamma.
                           # At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 400000                 # Number of training steps.
save_interval_steps: 5000               # Interval steps to save checkpoint.
eval_interval_steps: 1000               # Interval steps to evaluate the network.
log_interval_steps: 100                 # Interval steps to record the training log.

###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.

It is a mix of an original config and @erogol’s tts configs found on the fork.

I am training on a private speaker and will let it train until it reaches the full 400k mark and then decide if I go with 1000k. My goal is that, if this works, I want to train a LibriTTS universal vocoder on 22050Hz and release it here. We will see how it goes :grinning:

2 Likes

Thanks, that helps a lot. Would be great to have a vocoder to try. erogols old WaveRNN one with 16 KHz shows what it can do, just sounds like Micky :slight_smile: