Training a universal vocoder

Hello,

Since training a vocoder takes time and compute, I’d like to train and contribute a universal vocoder that works for most use cases.
I have compute but I’m no expert on TTS and I’d like help choosing hyper parameters and tuning the config file.

I know this is possible since @erogol did it with WaveRNN and it worked very well.
I’d like to do the same with faster inference speed to cover more use cases by using either MelGAN or PWGAN on the same LibriTTS dataset.

According to my understanding, the sample-rate of the dataset used to train Tacotron doesn’t really matter because it shouldn’t affect the mel spectogram (I’m not so sure about that), the only parameters that should affect it are :

  • num_mels (80)
  • mel_fmin (50)
  • mel_fmax (8000 to cover a scope as large as possible of speakers)
  • spec_gain (20 but I suspect that diffrent gain on TTSvsVocoder should be fixable without retraining)
  • fft_size (1024)
  • win_length (1024)
  • hop_length (256)

And so, still according to my understanding, these are the parameters that must be shared with all models that use the same vocoder.

The vocoder’s output sample rate shouldn’t matter too much but I think that 16kHz instead of the LibriTTS’s 24kHz should give a 33% boost in inference performance. (I’m not so sure about that as well since PWGAN and MelGAN are far more parallelised than WaveRNN)

What do you think about that ? Is it a good idea ? Am I off in my understanding of the TTS process ?

1 Like

Good idea, thought it might be better to downsample to 22050, since most TTS training I have seen happens with that sample rate. I also need to point out that I tried to train a universal vocoder using LibriTTS many times, but unfortunately it never worked well.

Could you please tell me what models did you try please and the rough configs ?

I only tried ParallelWaveGAN. The sample rate was 22050. The win length was 1100 and hop_length was 275. 80 mels and 0 mel_fmin. What I got was a lot of static and muffled voices. In a paper, it is mentioned it is much better to train on a much smaller amount of speakers, but with same amount of speech for each of them. The paper mentioned 6 speakers (3 male and 3 female) with 10 hours for each.

Hum ok thanks for the insights.
So I may go with m-ailabs on French/German/English one speaker f/m each then…
Did your config work fine on one speaker ? Do you think MelGAN would behave better with a high number of speakers ? (I don’t think MelGAN’s MOS is sufficient for the tradeoff speed/quality)

I am able to train PWGAN with LibriTTS better than static noise (still training). However my initial work indicated that we need to use a larger model since the original PWGAN model is very tiny and the variety in the speech requires a stronger model.

I also try to make the model more sampling rate agnostic by providing a different up-sampling network for each target sampling rate. And I provide different sampling rate input for each batch at training.

you can check the code here https://github.com/erogol/TTS_experiments/tree/generic_vocoder

1 Like

Oh ok I get why the sample rate of the TTS training data matters, it’s because for win_length, the unit of the 1024 is sample, so 1024 samples in 16k is not temporally the same as 24k. So I was wrong, input sr matters.

What I don’t get is why does the output sample rate of the vocoder matters, instead of focusing on variable output sample rate, shouldn’t we focus instead on variable input sample rate ? (since different sample rate produce different Mel Spectograms).
We could for exemple have the first few layers sr specific by switching it during trainning.
Or we could train a small model to do 16k Mel spectogram to 24k Mel spectogram (somthing like U-net)

Am I missing something @erogol ?

thanks

Edit: not u-net because the size of the spectrogram varies with duration, or windowed

Ok, just read your code.
I thought the upsampling was at the end of the network since you said target sampling rate. Forget what I said about U-net.
Wouldn’t it be simpler for the model to learn if the output sample rate was fixed ?
Because if I understand your code correctly, each time it produce a waveform with the target sr

It is another approach to produce a constant sr from a given input. Maybe you can try this.

1 Like

Alright, RN I’m trying converting 16k Mel Spectrograms into 22k ones with a model to see if without retraining, my Tacotron 16kHz can use Waveglow 22050Hz as a vocoder.
If it doesn’t work I’m gonna try constant constant sr output.

I think it’s a great start and actually a better approach probably to solve the sr mismatch unless the user does not like the flexibility of having multiple sr options.

Looking forward to see your results :slight_smile:

I think it also makes sense to train this mel-spec model for multiple sampling rates like 16k 22.50 and 24. I’d say it’d be hard to convert from many-to-one without explicitly providing input sr.

Hi quick update,

Just to test something I tried using the vocoder in the MultiBand_MelGAN_Example notebook trained in 22050Hz with my Tacotron2 fr 50k 16kHz.

Here’s a sample with griffin_lim.
.https://soundcloud.com/julian-weber-8/gl
Here’s one where I passed the mel_spec without processing
.https://soundcloud.com/julian-weber-8/without-pitch-correction
and then corrected for pitch with librosa.effects.pitch_shift and noise reduction
.https://soundcloud.com/julian-weber-8/with-pitch-correction
And finally, the better sounding one, where I stretched the mel-spec to the right length with Lanczos resampling.
.https://soundcloud.com/julian-weber-8/16ktts-22kmelgan

This is gonna be my baseline for the model that I’ll build although I’m not too sure that I can do better than that because your vocoder was trained on a different voice/language

2 Likes

Universal PWGAN looks promising after 500k iterations. But I guess it requires a larger model. I’ll this run continue and start another one with a larger model. Using different upsamle_net for different sampling rates looks working as well. This model can produce speech in 16khz, 22050hz and 24khz.

1 Like

Sounds great. What is the quality you get on the smaller model? Mine could generalize, but there was a “zzzzz” sound, and when I used more speakers, it sounded muffled.

just a small background noise

Are you training using the vocoder module on the dev branch? Could you share your config? If I have time maybe I can try training too :slight_smile: Although I think it is useless seeing as you have reached 500K. Are you planning to release the model?

That’s great !
Maybe for the background noise, we could try tweaking a bit the loss for minimum activation in the last layer of the generator ? (That might be a dumb idea but it’s cheap to try since you don’t have to retrain the whole model)
I’m sorry but I had to work on something else this week and maybe the beginning of next week. I’ll keep you posted if I make any improvements.

what do you mean exactly with it?

I stopped the training since the results do not improve. The model is able to produce good speech for speakers in LibriTTS but some background noise again for new speakers. The next, I’ll try a larger model.