Tacotron 2 with ParallelWaveGAN. Next step


So i trained Mozilla TTS with Tacotron2 using a custom dataset. The Griffin Lim previews are starting to sound really good, although robotic.

I now want to move on to use the ParallelWaveGAN vocoder.

  1. How do i go about doing that? Is there a notebook or syntax to run it?
  2. What are the prerequisits?
  3. Do i need to train the ParallelWaveGAN model at all, like i did with the TTS model?

I just want to understand the full process before starting to refine my original dataset and possibly retrain my tts model with larger set and better quality audio.

I am stuck at the TTS training stage right now.

Any pointers would be greatly appreciated

You need to run the „extract features“ notebook, then start the PwGAN training. Make sure to use the same settings for audio processing (sample rate, trim silence, normalize, fmin, fmax, etc.) as you have set for the Taco2 training.

Thanks but I couldnt find this "extract features“ notebook. In my TTS installation there are a couple of them and one is called “ExtractTTSpectrogram.ipynb”.

However the description says "This is a notebook to generate mel-spectrograms from a TTS model to be used for WaveRNN training.

Sorry, I mixed this up. For the PwGAN you need to run a preprocess script (which basically does the same as the extract-features notebook for WaveRNN).

Which PwGAN implementation are you going to use?

What branch are you using?

You can also train on ground truth, just set the “data_path”: in the config to the path where all your wav’s are.
Set the audio parameters equal your TTS settings.

Start the training as stated here: https://github.com/mozilla/TTS/tree/dev/mozilla_voice_tts/vocoder

You can also check this https://github.com/erogol/TTS_recipes repository, which contains full recipes for training TTS and Vocoders. You can try adjusting them to your needs.


I am using the Mozilla TTS dev branch

Sorry for being a stupid noob. I am not sure what you mean by that.

I am using Colab for the TTS and was planning to use this

Honestly, i dont care which vocoder i use. I just want the best sounding one. I have no need for real time TTS, so i just want to render sentences to audio “offline” so to speak. Do you have any recommendations based on that?

The TTS model is trained at this point, so now i just want it to sound better.

This is the old implementation.

The current dev branch has a nativ vocoder support: https://github.com/mozilla/TTS/tree/dev/mozilla_voice_tts/vocoder

Under the configs folder check the parallel_wavegan_config.json.

Follow my steps from above:
1: Set audio parameters
2: Set path to your wav files
3: Start training

Ok, so no extraction step needed?

I was hoping to use a pretrained vocoder model to reduce training time. So i need to start from scratch based on my trained TTS model?

I am cloning this into my colab notebook at the moment
!git clone -b dev https://github.com/mozilla/TTS

Using CUDA: True
Number of GPUs: 1
usage: train_vocoder.py [-h] --continue_path CONTINUE_PATH
[–restore_path RESTORE_PATH] --config_path
CONFIG_PATH [–debug DEBUG] [–rank RANK]
[–group_id GROUP_ID]
train_vocoder.py: error: the following arguments are required: --continue_path, --config_path

Are these paths to the TTS model and TTS config file or the Vocoder ones?

Thanks for your help!!

I tried this

!python /content/TTS/mozilla_voice_tts/bin/train_vocoder.py --config_path “/content/TTS/mozilla_voice_tts/vocoder/configs/parallel_wavegan_config.json”

It seems to be working.

  1. Where is the vocoder model saved then? I configured my TTS model to be saved direcly to Drive, and it would be cool if I could do the same for the Vocoder

Yes, works also without extraction. There is also an option to train on extracted features but I haven’t tested it yes.

Those are paths to the vocoder, if you want to fine_tune or continue training later on.

In the config you can set the output_path, somewhere at the bottom.

Thanks man, i REALLY appreciate it!

  1. So the Vocoder and the TTS model are completly separate although trained on the same data set?

  2. When i later have trained the vocoder on the same dataset, these two are combined/working together to synthesize the sentence? Is this a correct description?

  3. At what point will i be able to test these models together? Is there a notebook available that allows me to point to my TTS and Vocoder model to generate audio?

  1. yes.

  2. In a very generalized way: TTS models are trained on audio+text input and output mel-spectrograms. Vocoder models are trained on mel-specs input and has audio output.
    (Griffin-Lim vocoder is a special case - it interpolates directly audio from mel-specs without a model, but quality is inferior compared to vocoder models)

  3. synthesize.py in dev-branch has vocoder-path option where you can specify vocoder model. I believe there is also a notebook…

Thanks a lot! This really helps me to learn

But you NEED to train the TTS model before training the vocoder, right? Cant have one without the other?

You can train the vocoder first, but it is of no use standalone as it needs mel-specs from the TTS as input for inference (synthesis).

Wouldnt it make sense to use a pretrained vocoder model instead of starting from scratch like i did?