So i trained Mozilla TTS with Tacotron2 using a custom dataset. The Griffin Lim previews are starting to sound really good, although robotic.
I now want to move on to use the ParallelWaveGAN vocoder.
How do i go about doing that? Is there a notebook or syntax to run it?
What are the prerequisits?
Do i need to train the ParallelWaveGAN model at all, like i did with the TTS model?
I just want to understand the full process before starting to refine my original dataset and possibly retrain my tts model with larger set and better quality audio.
You need to run the „extract features“ notebook, then start the PwGAN training. Make sure to use the same settings for audio processing (sample rate, trim silence, normalize, fmin, fmax, etc.) as you have set for the Taco2 training.
Thanks but I couldnt find this "extract features“ notebook. In my TTS installation there are a couple of them and one is called “ExtractTTSpectrogram.ipynb”.
However the description says "This is a notebook to generate mel-spectrograms from a TTS model to be used for WaveRNN training.
Sorry, I mixed this up. For the PwGAN you need to run a preprocess script (which basically does the same as the extract-features notebook for WaveRNN).
You can also train on ground truth, just set the “data_path”: in the config to the path where all your wav’s are.
Set the audio parameters equal your TTS settings.
You can also check this https://github.com/erogol/TTS_recipes repository, which contains full recipes for training TTS and Vocoders. You can try adjusting them to your needs.
Honestly, i dont care which vocoder i use. I just want the best sounding one. I have no need for real time TTS, so i just want to render sentences to audio “offline” so to speak. Do you have any recommendations based on that?
The TTS model is trained at this point, so now i just want it to sound better.
Where is the vocoder model saved then? I configured my TTS model to be saved direcly to Drive, and it would be cool if I could do the same for the Vocoder
So the Vocoder and the TTS model are completly separate although trained on the same data set?
When i later have trained the vocoder on the same dataset, these two are combined/working together to synthesize the sentence? Is this a correct description?
At what point will i be able to test these models together? Is there a notebook available that allows me to point to my TTS and Vocoder model to generate audio?
In a very generalized way: TTS models are trained on audio+text input and output mel-spectrograms. Vocoder models are trained on mel-specs input and has audio output.
(Griffin-Lim vocoder is a special case - it interpolates directly audio from mel-specs without a model, but quality is inferior compared to vocoder models)
synthesize.py in dev-branch has vocoder-path option where you can specify vocoder model. I believe there is also a notebook…