Example of healthy loss functions in Tacotron2 + Multiband Melgan?

Hi,

I am testing the approach with Tacotron2 + Multiband Melgan in the dev branch (commit 8c07ae7).
I was wondering if anyone has an example of what a healthy loss function looks like when training the vocoder.

In my case, (with a smallish custom dataset) I observed some weird behavior the moment we introduce the discriminator (steps_to_start_discriminator).
Before this point the generator losses were trending down (as expected). After this point they somewhat blew up.

Is this expected? Just a matter of training for longer? Maybe setting a larger steps_to_start_discriminator ?

Thanks!

Recently there were some discussions here on discriminator training, to my understanding there are two options you can try: a) start discriminator training right away from the beginning or b) let discriminator training kick in far later, e.g. step 500k (default is IIRC at 200k)

Thanks Dominik. I will try both mechanisms and see what works best in my dataset.

I haven’t trained often enough to get a good sense of what works best on my dataset or overall. I did have a number of false starts and then by perseverance I ended up with a single quite good result, but it may well be not the optimal approach.

My good run was done with the discriminator kicking in at 600k (following a point @erogol had made on his blog). When it does so, there’s a long period where it definitely sounds like the audio you hear in Tensorboard got worse, but in time you could very gradually start to hear that it was actually getting better than it had been earlier on (before the discriminator kicked in).

Very interesting @nmstoker.
Maybe I just need let it train for a longer period of time.

In any case, can you share a printscreen of the losses you got on your good run ? It would be useful to have an idea on the magnitude of the losses I should aim for in order to get good results.

Hi Rui,

Here you go. I did have some computer memory issues (since resolved in dev branch) during this time, so for much of it i was stopping and restarting training which tends to disrupt the charts for a bit.

I am wary to say “just train longer” because it takes a long time to train and I’d feel bad if it turned out then that you trained for ages and got poor results (ie wasting your time) but I did find that for me the quality improvement started being most noticeable around 800-900k with it starting to eclipse my best prior model without a vocoder around then and it got to exceed that best prior model as it went from 900k to 1m. There was a gradual noticable improvement before that 800k point but it was a little odd as in some ways it still sounded worse than when the discriminator kicked in but there was a more human sound creeping in even though the quality was set back. Anyway, when I get around to doing another run, I hope to gain more of a sense of whether this was “normal” or not. Maybe if others chip in we’ll get a picture more quickly too.

Hope it goes well and I’d be interested to hear updates

Kind regards,
Neil

BTW: here’s a demo of the results using the vocoder from the 1m point (by that time I hadn’t completed the 1m-1.2m run that’s the pink line above; to be honest I’m in two minds as to whether it improved much during that final phase up to 1.2m)

1 Like