Tacotron 2 with ParallelWaveGAN. Next step

CrazyJoeDevola · August 28, 2020, 5:39am

Hi

So i trained Mozilla TTS with Tacotron2 using a custom dataset. The Griffin Lim previews are starting to sound really good, although robotic.

I now want to move on to use the ParallelWaveGAN vocoder.

How do i go about doing that? Is there a notebook or syntax to run it?
What are the prerequisits?
Do i need to train the ParallelWaveGAN model at all, like i did with the TTS model?

I just want to understand the full process before starting to refine my original dataset and possibly retrain my tts model with larger set and better quality audio.

I am stuck at the TTS training stage right now.

Any pointers would be greatly appreciated

dkreutz · August 28, 2020, 3:11pm

You need to run the „extract features“ notebook, then start the PwGAN training. Make sure to use the same settings for audio processing (sample rate, trim silence, normalize, fmin, fmax, etc.) as you have set for the Taco2 training.

CrazyJoeDevola · August 29, 2020, 6:40am

Thanks but I couldnt find this "extract features“ notebook. In my TTS installation there are a couple of them and one is called “ExtractTTSpectrogram.ipynb”.

However the description says "This is a notebook to generate mel-spectrograms from a TTS model to be used for WaveRNN training.

dkreutz · August 29, 2020, 8:07am

Sorry, I mixed this up. For the PwGAN you need to run a preprocess script (which basically does the same as the extract-features notebook for WaveRNN).

Which PwGAN implementation are you going to use?

sanjaesc · August 29, 2020, 8:09am

What branch are you using?

You can also train on ground truth, just set the “data_path”: in the config to the path where all your wav’s are.
Set the audio parameters equal your TTS settings.

Start the training as stated here: https://github.com/mozilla/TTS/tree/dev/mozilla_voice_tts/vocoder

You can also check this https://github.com/erogol/TTS_recipes repository, which contains full recipes for training TTS and Vocoders. You can try adjusting them to your needs.

CrazyJoeDevola · August 29, 2020, 8:21am

Hi

I am using the Mozilla TTS dev branch

CrazyJoeDevola · August 29, 2020, 8:28am

Sorry for being a stupid noob. I am not sure what you mean by that.

I am using Colab for the TTS and was planning to use this

CrazyJoeDevola · August 29, 2020, 8:32am

Honestly, i dont care which vocoder i use. I just want the best sounding one. I have no need for real time TTS, so i just want to render sentences to audio “offline” so to speak. Do you have any recommendations based on that?

The TTS model is trained at this point, so now i just want it to sound better.

sanjaesc · August 29, 2020, 8:34am

This is the old implementation.

The current dev branch has a nativ vocoder support: https://github.com/mozilla/TTS/tree/dev/mozilla_voice_tts/vocoder

Under the configs folder check the parallel_wavegan_config.json.

Follow my steps from above:
1: Set audio parameters
2: Set path to your wav files
3: Start training

CrazyJoeDevola · August 29, 2020, 8:35am

Ok, so no extraction step needed?

I was hoping to use a pretrained vocoder model to reduce training time. So i need to start from scratch based on my trained TTS model?

CrazyJoeDevola · August 29, 2020, 8:36am

I am cloning this into my colab notebook at the moment
!git clone -b dev GitHub - mozilla/TTS: 🤖 Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

CrazyJoeDevola · August 29, 2020, 8:42am

Using CUDA: True
Number of GPUs: 1
usage: train_vocoder.py [-h] --continue_path CONTINUE_PATH
[–restore_path RESTORE_PATH] --config_path
CONFIG_PATH [–debug DEBUG] [–rank RANK]
[–group_id GROUP_ID]
train_vocoder.py: error: the following arguments are required: --continue_path, --config_path

Are these paths to the TTS model and TTS config file or the Vocoder ones?

CrazyJoeDevola · August 29, 2020, 8:45am

Thanks for your help!!

I tried this

!python /content/TTS/mozilla_voice_tts/bin/train_vocoder.py --config_path “/content/TTS/mozilla_voice_tts/vocoder/configs/parallel_wavegan_config.json”

It seems to be working.

Where is the vocoder model saved then? I configured my TTS model to be saved direcly to Drive, and it would be cool if I could do the same for the Vocoder

sanjaesc · August 29, 2020, 8:46am

Yes, works also without extraction. There is also an option to train on extracted features but I haven’t tested it yes.

Those are paths to the vocoder, if you want to fine_tune or continue training later on.

In the config you can set the output_path, somewhere at the bottom.

CrazyJoeDevola · August 29, 2020, 8:51am

Thanks man, i REALLY appreciate it!

So the Vocoder and the TTS model are completly separate although trained on the same data set?
When i later have trained the vocoder on the same dataset, these two are combined/working together to synthesize the sentence? Is this a correct description?
At what point will i be able to test these models together? Is there a notebook available that allows me to point to my TTS and Vocoder model to generate audio?

dkreutz · August 29, 2020, 9:29am

yes.
In a very generalized way: TTS models are trained on audio+text input and output mel-spectrograms. Vocoder models are trained on mel-specs input and has audio output.
(Griffin-Lim vocoder is a special case - it interpolates directly audio from mel-specs without a model, but quality is inferior compared to vocoder models)
synthesize.py in dev-branch has vocoder-path option where you can specify vocoder model. I believe there is also a notebook…

CrazyJoeDevola · August 29, 2020, 9:32am

Thanks a lot! This really helps me to learn

CrazyJoeDevola · August 29, 2020, 12:12pm

But you NEED to train the TTS model before training the vocoder, right? Cant have one without the other?

dkreutz · August 29, 2020, 12:16pm

You can train the vocoder first, but it is of no use standalone as it needs mel-specs from the TTS as input for inference (synthesis).

CrazyJoeDevola · August 29, 2020, 12:21pm

Wouldnt it make sense to use a pretrained vocoder model instead of starting from scratch like i did?

Topic		Replies	Views
Tacotron2 + PWGAN produces Deep/Muffled Voice TTS (Text-to-Speech)	9	2951	June 7, 2021
My latest results using private dataset trained Tacotron2 model with MelGAN vocoder TTS (Text-to-Speech)	14	2429	July 14, 2020
Noob need help with Mozilla TTS TTS (Text-to-Speech)	3	964	August 26, 2020
Custom WaveRNN Training Woes TTS (Text-to-Speech)	3	1384	August 23, 2019
Query regarding post processing TTS (Text-to-Speech)	49	2147	September 19, 2019

Tacotron 2 with ParallelWaveGAN. Next step

Related topics