Training universal PWGAN, background noise and bad output

georroussos · July 23, 2020, 9:40am

I have been trying to train a universal ParallelWaveGAN for a long while now but I have not really had any success. I first started with a private single speaker and it worked very well with no background noise, so I thought I would try my luck with LibriTTS. My understanding was that the network would be exposed to many, many more waveforms, so it would be able to produce speech for any speaker. The models I have been training have all been outputting sound with artifacts and background noise, like a static “zzzz”. I thought I could introduce some more private speaker to LibriTTS, so I did (sum was Libri-360 and 200 more speakers). The results did not really change. At best, they sound muffled. I would never expect WaveRNN quality, but I would at least expect single speaker quality. I also do not know what to make of my tensorboard logs, but I think they do look pretty erratic.

    format: "hdf5"
  audio:
    clip_norm: true
    do_trim_silence: false
    frame_length_ms: 50
    frame_shift_ms: 12.5
    max_norm: 4
    hop_length: 275
    win_length: 1100
    mel_fmax: 8000.0
    mel_fmin: 0.0
    min_level_db: -100
    num_freq: 1025
    num_mels: 80
    preemphasis: 0.98
    ref_level_db: 20
    sample_rate: 22050
    signal_norm: true
    sound_norm: false
    symmetric_norm: true
    trim_db: 20

  generator_params:
      in_channels: 1        # Number of input channels.
      out_channels: 1       # Number of output channels.
      kernel_size: 3        # Kernel size of dilated convolution.
      layers: 30            # Number of residual block layers.
      stacks: 3             # Number of stacks i.e., dilation cycles.
      residual_channels: 64 # Number of channels in residual conv.
      gate_channels: 128    # Number of channels in gated conv.
      skip_channels: 64     # Number of channels in skip conv.
      aux_channels: 80      # Number of channels for auxiliary feature conv.
                            # Must be the same as num_mels.
      aux_context_window: 2 # Context window size for auxiliary feature.
                            # If set to 2, previous 2 and future 2 frames will be considered.
      dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
      use_weight_norm: true # Whether to use weight norm.
                            # If set to true, it will be applied to all of the conv layers.
      upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
      upsample_params:                      # Upsampling network parameters.
        upsample_scales:
        - 5
        - 5
        - 11


  discriminator_params:
      in_channels: 1        # Number of input channels.
      out_channels: 1       # Number of output channels.
      kernel_size: 3        # Number of output channels.
      layers: 10            # Number of conv layers.
      conv_channels: 64     # Number of chnn layers.
      bias: true            # Whether to use bias parameter in conv.
      use_weight_norm: true # Whether to use weight norm.
                            # If set to true, it will be applied to all of the conv layers.
      nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
      nonlinear_activation_params:      # Nonlinear function parameters
          negative_slope: 0.2           # Alpha in LeakyReLU.


  stft_loss_params:
      fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
      hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
      win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
      window: "hann_window"         # Window function for STFT-based loss


  lambda_adv: 4.0  # Loss balancing coefficient.


  batch_size: 8              # Batch size.
  batch_max_steps: 26125     # Length of each audio in batch. Make sure dividable by hop_size.
  pin_memory: true           # Whether to pin memory in Pytorch DataLoader.
  num_workers: 8             # Number of workers in Pytorch DataLoader.
  remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
  allow_cache: false          # Whether to allow cache in dataset. If true, it requires cpu memory.


  generator_optimizer_params:
      lr: 0.0001             # Generator's learning rate.
      eps: 1.0e-6            # Generator's epsilon.
      weight_decay: 0.0      # Generator's weight decay coefficient.
  generator_scheduler_params:
      step_size: 200000      # Generator's scheduler step size.
      gamma: 0.5             # Generator's scheduler gamma.
                             # At each step size, lr will be multiplied by this parameter.
  generator_grad_norm: 10    # Generator's gradient norm.
  discriminator_optimizer_params:
      lr: 0.00005            # Discriminator's learning rate.
      eps: 1.0e-6            # Discriminator's epsilon.
      weight_decay: 0.0      # Discriminator's weight decay coefficient.
  discriminator_scheduler_params:
      step_size: 200000      # Discriminator's scheduler step size.
      gamma: 0.5             # Discriminator's scheduler gamma.
                             # At each step size, lr will be multiplied by this parameter.
  discriminator_grad_norm: 1 # Discriminator's gradient norm.


  discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
  train_max_steps: 1000000                 # Number of training steps.
  save_interval_steps: 5000               # Interval steps to save checkpoint.
  eval_interval_steps: 1000               # Interval steps to evaluate the network.
  log_interval_steps: 100                 # Interval steps to record the training log.


  num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.

I have downsampled the dataset and changed the hop and win sizes to accomodate with my TTS attributes.

erogol · July 23, 2020, 1:00pm

did you change the libriTTS sampling rate since the default is 24khz I think which is not compatible with your config file?

georroussos · July 23, 2020, 1:04pm

I did! I downsampled LibriTTS to 22050 to match my TTS.

erogol · July 23, 2020, 1:05pm

have you listened the samples after down sampling ?

erogol · July 23, 2020, 1:05pm

Ohh you tried the other PWGAN repo. Maybe you can try the one under TTS dev branch.

georroussos · July 23, 2020, 1:07pm

I did that also, they sounded fine. Yes I trained using the initial PWGAN fork. Is the one under TTS dev different? I can try that one instead. Does it have anything different?

erogol · July 23, 2020, 1:08pm

the model is the same but it’d be more convenient.

georroussos · July 23, 2020, 1:11pm

If it is the same I guess I would have the same performance, which is really strange because I expected better. I don’t know why the same attributes give good results in single speaker but not multispeaker. Have you trained any models on LibriTTS?

erogol · July 23, 2020, 1:13pm

not yet. Maybe the normalization is the culprit.

georroussos · July 23, 2020, 1:15pm

Aha interesting. Looking at my config, though, it is turnt off. Are you saying it is turnt on? I am also at a loss when it comes to the figures for discriminator loss, fake loss and real loss in training, they really look erratic.

erogol · July 24, 2020, 12:30pm

I don’t know if it is turned on or off but it is better to check if it makes sense. Also you can try just a small number of speakers and see how number of speakers affect the problem in general.

georroussos · July 24, 2020, 12:36pm

I think I will try to train another model in the scheme reported here https://papers.nips.cc/paper/9629-melgan-generative-adversarial-networks-for-conditional-waveform-synthesis.pdf. They report 6 speakers with 10 hours each is enough to generalize.

julian.weber · August 5, 2020, 2:25pm

Hi,
I saw under the " Train Multispeaker Dataset + WaveRNN" post that you made it work. Could you please share the weights and some samples ?
It would be great to have a few universal vocoders since they are so expensive to train.

georroussos · August 5, 2020, 4:06pm

Hi, unfortunately it didn’t really work. The vocoders I trained were all bad. A lot of static noise and the voices sound a bit coarse.