Radam optimizer

I see that the dev branch is now using RAdam optimizer.

FYI, there is a newer even better optimizer called Ranger: https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d

It combines Radam with new ideas from Hinton’s paper. I don’t have a personal experience with it, but the paper looks very reasonable.

Thx for pointing. However, it needs 2 copies of the model weights which would use more GPU memory. I guess, it is better for now to keep it with RADAM

Isn’t the memory for keeping the weights negligible. Tacotron has about 7M parameters, Taco2 has about 20M. Stored as 4 byte floats, it’s about 30MB and 80MB. For any decent GPU that’s negligible. Also, it’s small compared to the memory for training data.

You don’t only keep weight. You also forward and backward on these. I’d guess it is almost 2x more memory but I might be wrong. (I just skimmed the post)