Apply Learning Rate Decay to the Adam Optimizer

Hello,

I am waiting to use some modified DeepSpeech code on a GPU and wanted to know if anyone has implemented learning rate decay to the Adam Optimizer already before I begin training. Does anyone have reasons they wouldn’t want to do this? My code block is below. This would likely change the best starting point to a much higher learning rate but might also help me avoid early stopping inappropriately because I didn’t select a good initial learning rate.

def create_optimizer():

    with open(FLAGS.train_files,"r") as f:
    reader = csv.reader(f,delimiter = ",")
    data = list(reader)
    row_count = len(data)

    decay_steps = int(row_count / Config.available_devices / FLAGS.train_batch_size) * FLAGS.es_steps # a hueristic 
    global_step = tf.Variable(0, trainable=False)
    starter_learning_rate = FLAGS.learning_rate
    learning_rate = tf.train.exponential_decay(starter_learning_rate, 
        global_step,
        decay_steps,
        0.92,
        staircase=True,
        name='lr_decay')

    # Passing global_step to minimize() will increment it at each step.
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                       beta1=FLAGS.beta1,
                                       beta2=FLAGS.beta2,
                                       epsilon=FLAGS.epsilon)
    return optimizer

@lissyx, @reuben do the two of you have any thoughts on doing this? The original paper mentions:

We use momentum of 0.99 and anneal the learning rate by a constant factor, chosen to yield the fastest convergence, after each epoch through the data.

Was there a reason you wouldn’t implement this? Did you already experiment with this?

Adam already has a similar mechanism so we didn’t bother hand tuning that. Please share your results!

This didn’t appear to speed up training. Applying my own step-wise fine tuning worked more easily.