Parallel WaveGAN in Dev

Hi @erogol - in Dev I saw mention of Parallel WaveNet.

Q. Is it best to wait for it to reach master or is it okay to use now (bearing in mind of course it’s dev, thus prone to change, caveat emptor etc etc :slightly_smiling_face: )

I am tempted to give it a go, but haven’t trained using WaveRNN before, so would need to read up on what’s needed and hope Parallel WaveNet needs a fairly similar (?)

Unless you advise otherwise was thinking of trying Tacotron 1 + Parallel WaveNet.

There is that colab on WaveRNN so I’m hoping with some common sense and care I’ll figure it out by adapting from that.

Any tips or advice I should bear in mind?

Pretty much the same way you can use PWGAN but it is not documented yet (I am laggish in documentation). You need to find your way around for the time being. But if you have any question let me know here.

Basic workflow.

use preprocess_tts.py to generate mel files
set tts_config.yaml (mostly the paths)
run bin/train.py

1 Like

Great, I’ll have a go in the next few days then. If I figure it out, happy to have a go suggesting some details for the docs as a PR

1 Like

Brief update: I’ve given it a go. The preprocess_tts.py step seemed okay (had minor confusion as thought the data_path parameter was set in the config but script was sat doing nothing till I figured that out)

With train.py I’ve immediately hit the memory limitation of my Ti180 (11GB Vs the 12 you mention needed in tts_config.yaml)

Therefore have given it a go by reducing the batch size and that seemed to work (so far). I was unsure which hop size was applicable to the batch size setting as it’s mentioned in several places in config. As I had seen the terminal output listing hop size as 275 so went with a multiple of that and it seems to be working (although I’ve had to leave it running now due to commitments). I set batch size to 24,200 (being a multiple of 275 that is scaled down by a little more than 11/12ths, with the hope that would then fit on the Ti1080).

I’ll be back at the computer tomorrow morning so will see how it has got on.

Currently am training on LJ Speech, simply to ensure I can recreate things with minimal changes and then I’ll look at trying my dataset once I’ve a better grasp on how it works.

Update 2: Looks like the batch size change has been okay - at least in so far as it hasn’t run out of memory since doing that.

It’s on roughly 124k iterations now, having run for ~29 hrs. It appears to be going okay and I just had a look at progress in tensorboard. The various lines seem to have flattened out now - is this normal and should I consider stopping training?

I’m fine to leave it going longer but right now it’s hard to tell if they are properly flat or might still be going down ever-so-slightly. The spectral convergence loss does look like it could be improving still and likewise perhaps the STFT magnitude loss.

Looking in the predictions folder, the audio samples are short but listening to them, they do seem pretty close to the ref source files. That’s backed up by the waveform comparison images which are visually quite close but with definite but subtle differences in the finer detail sections:

4

Actually using the cursor to check the exact figures on spectral convergence loss, it is definitely still going down subtly and likewise on the others, it’s just v. slow, so I’ll keep it going.

Incidentally I found that the batch size adjustment solution for the memory issue was (independently) what the author of the original repo recommended here:

It’s still training but should finish by tomorrow evening or so (maybe earlier but I’ll be at work!)

Thinking ahead, on a practical note: once trained, how do I then use it?

With WaveRNN there’s a code within synthesize.py that calls WaveRNN’s generate method (here: https://github.com/mozilla/TTS/blob/8af75cad46d1067e1aeff3ebf46a9b2d0a5f4f98/synthesize.py#L37 ), but the organisation of ParallelWaveGAN seems to be somewhat different - is there a simple approach I’m missing or would I need to adapt it? Am guessing looking at decode.py might be the place to start.

I updated the TTS branch and added an inference function for the model with optional input folding. You just need to get the spectrogramn output and give it to model.inference(), similar to the bechmark notebook.

I also get better results with PWGAN after I enable silence triming which I forgot before.

1 Like

Any comparison on wavernn and PWGAN?

Also, any plans on trying to get melgan up?

wavernn sounds better so far.

No plans for melgan yet.

I’ll give a more detailed update at the weekend but in summary: I did manage to get things hooked up, feeding in the spectrograms but I suspect I may have to debug my code a bit just in case I messed up somewhere as the quality output with PWGAN for me is terrible! It sounds more like a small swarm of bees :slightly_smiling_face:

This is despite what sounded like good results from training PWGAN for the full 400k iterations over ~four days.

I took the 260k Tacotron 2 pretrained model and it works well with GL but with PWGAN I’m getting no distinct words and like I say a “bee swarm” effect

now I am trying mel-gan. PWGAN has a constant background noise for some reason. Maybe mel-gan could work better.

Thanks for the update @erogol. I was planning on trying PWGAN with a wavernn generator instead of the wavenet one.

Please do let us know if you could use any help.