Training a universal vocoder

Oh ok I get why the sample rate of the TTS training data matters, it’s because for win_length, the unit of the 1024 is sample, so 1024 samples in 16k is not temporally the same as 24k. So I was wrong, input sr matters.

What I don’t get is why does the output sample rate of the vocoder matters, instead of focusing on variable output sample rate, shouldn’t we focus instead on variable input sample rate ? (since different sample rate produce different Mel Spectograms).
We could for exemple have the first few layers sr specific by switching it during trainning.
Or we could train a small model to do 16k Mel spectogram to 24k Mel spectogram (somthing like U-net)

Am I missing something @erogol ?

thanks

Edit: not u-net because the size of the spectrogram varies with duration, or windowed

Ok, just read your code.
I thought the upsampling was at the end of the network since you said target sampling rate. Forget what I said about U-net.
Wouldn’t it be simpler for the model to learn if the output sample rate was fixed ?
Because if I understand your code correctly, each time it produce a waveform with the target sr

It is another approach to produce a constant sr from a given input. Maybe you can try this.

1 Like

Alright, RN I’m trying converting 16k Mel Spectrograms into 22k ones with a model to see if without retraining, my Tacotron 16kHz can use Waveglow 22050Hz as a vocoder.
If it doesn’t work I’m gonna try constant constant sr output.

I think it’s a great start and actually a better approach probably to solve the sr mismatch unless the user does not like the flexibility of having multiple sr options.

Looking forward to see your results :slight_smile:

I think it also makes sense to train this mel-spec model for multiple sampling rates like 16k 22.50 and 24. I’d say it’d be hard to convert from many-to-one without explicitly providing input sr.

Hi quick update,

Just to test something I tried using the vocoder in the MultiBand_MelGAN_Example notebook trained in 22050Hz with my Tacotron2 fr 50k 16kHz.

Here’s a sample with griffin_lim.
.https://soundcloud.com/julian-weber-8/gl
Here’s one where I passed the mel_spec without processing
.https://soundcloud.com/julian-weber-8/without-pitch-correction
and then corrected for pitch with librosa.effects.pitch_shift and noise reduction
.https://soundcloud.com/julian-weber-8/with-pitch-correction
And finally, the better sounding one, where I stretched the mel-spec to the right length with Lanczos resampling.
.https://soundcloud.com/julian-weber-8/16ktts-22kmelgan

This is gonna be my baseline for the model that I’ll build although I’m not too sure that I can do better than that because your vocoder was trained on a different voice/language

2 Likes

Universal PWGAN looks promising after 500k iterations. But I guess it requires a larger model. I’ll this run continue and start another one with a larger model. Using different upsamle_net for different sampling rates looks working as well. This model can produce speech in 16khz, 22050hz and 24khz.

1 Like

Sounds great. What is the quality you get on the smaller model? Mine could generalize, but there was a “zzzzz” sound, and when I used more speakers, it sounded muffled.

just a small background noise

Are you training using the vocoder module on the dev branch? Could you share your config? If I have time maybe I can try training too :slight_smile: Although I think it is useless seeing as you have reached 500K. Are you planning to release the model?

That’s great !
Maybe for the background noise, we could try tweaking a bit the loss for minimum activation in the last layer of the generator ? (That might be a dumb idea but it’s cheap to try since you don’t have to retrain the whole model)
I’m sorry but I had to work on something else this week and maybe the beginning of next week. I’ll keep you posted if I make any improvements.

what do you mean exactly with it?

I stopped the training since the results do not improve. The model is able to produce good speech for speakers in LibriTTS but some background noise again for new speakers. The next, I’ll try a larger model.

I meant a small penalty term proportional to the intensity of each pixel to force the model to produce only the necessary sounds, but it won’t work since you say that there is no noise for the training set.

Maybe it’s an issue of robustness to different recording conditions and sound preprocessing, have you tried with an unseen LibriTTS speaker ?

no I didn’t try a speaker from libriTTS. It may be better with those.

You can maybe help me to train the larger model. I can provide the config.json in that case.

Absolutely, I’d love to help! I have a free GPU late this week (Thursday or Friday) or next week, if you don’t mind waiting

Can you create an issue on the repo for this to follow to progress there? I’ll post the config there to make it available to everyone who is interested.

1 Like

@erogol Could you please share the model, config and commit of the smaller model that you trained as well ? If I’ve got time latter this week, I’d like to try to fine-tune it with an augmented version of the dataset (artificial noise, gain/pitch change, etc…) to see if I can make it more robust noise wise.