There are many pretrained vocoders on the original repo and some of them are on LibriTTS. They are very good quality. I wonder if anyone has used them with Mozilla TTS as is.
no they use different normalization method for network inputs. But if you train a new TTS model using mean-var normalization then you can use their models. Which is possible with dev branch.
Ah, that is such a shame Some of them are trained for 1M steps and when I performed analysis and synthesis using the LibriTTS one, it sounded so good! I am now finetuning the universal WaveRNN you once trained, for 22050Hz instead and I will share it if anyone is interested That will hopefully make it easier to apply on more models. On 16kHZ models it performs super nicely, but on 22kHZ there is some noise. I think I should finetune more. How much more do you think I should do? I am finetuning on LibriTTS.
It is hard to guess. Just listen the generated audios and see the quality.
You can consider to train PWGAN using my branch on LibriTTS. You can even finetune their released model for faster convergence.
Another option so to renormalize the spectrograms before PWGAN using their method at inference. That might also work.
I had the same idea, to finetune their model, in order to make it compatible with your fork but I think I messed it up. In short, I couldn’t find the correct yaml, because if I cloned the repo using the latest commit, it gave me the yaml for MelGAN and then, if I checked out the PWGAN commit, it gives me the ttsv1 and ttsv2 configs, but not the melgan one in bin/configs, which is needed for feature extraction and I guess training. Because the melgan config would train MelGAN, wouldn’t it? Also, are these features for both PWGan and MelGAN? I got confused otherwise yes, I can totally try it!! But I don’t know what the correct workflow and configs are. Their LibriTTS vocoder is extremely good.
You can take any config from anywhere and run with my branch. Configs are compatible.
Cool! I will try using the config that came with the model
Which commit do I use to train? fca88f9 doesn’t have the config for preprocessing and the latest one doesn’t have the tts configs. The one in configs is MelGAN.
I tried the configs from the original repo but they didn’t work.
didn’t work means? what was the error?
First I tried to extract the feats using this config https://github.com/erogol/ParallelWaveGAN/blob/tts/parallel_wavegan/configs/melgan.v3.long.tts.yaml
But this is for Melgan training and I wanna train PWGAN
So I checked out fca88f9 to get the tts configs. But these configs are not like the one in the original PWGAN repo, https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/libritts/voc1/conf/parallel_wavegan.v1.long.yaml
have you tried configs directly from the origina repo?
Yep, the one here https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/libritts/voc1/conf/parallel_wavegan.v1.long.yaml which accompanies the LibriTTS model I want to use for finetuning, but it seems that the parameters in the yaml file are sprinkled all around and have different formatting
which parameter for instance?
I think the config in the original repo doesn’t have the datasets section found in the tts configs, and the audio section in the tts configs has options that the other one doesn’t have. I tried to add the whole of it, but it didn’t work. The product for upsample scales is computed using 4 values in the original config and 3 in your config. And the original has a section for STFT losses.
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
what you can quite easily do is that take the original config and change the fields as necessary by my fork.
I will try again. What config should I be using for feature extraction? The melgan one in configs?
for future extraction only thing matters is the audio parameters. just copy and paste the them from the melgan config to the config you like to use. You could all figure these by reading the feature extraction code.
I was able to get it to work after all I took the PWGAN+TTS notebook, here https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR and @edresson1’s fork, since that is the one I work with, and I changed the synthesize function around. I was able to load both a trained vocoder using your fork and the vocoders from the original repo, using a franken’d config.
Sadly, the quality is not good. I think it is because of what you mentioned with normalization. The voice comes out as hollow and distorted, both with the original repo vocoders and the one I tried to finetune, even though after 30.000 steps (using LibriTTS as finetune) was producing good speech during eval. Now I will try to finetune the LJSpeech vocoder you trained for 40.000 steps and see if I get better results.