Hello,
I am waiting to use some modified DeepSpeech code on a GPU and wanted to know if anyone has implemented learning rate decay to the Adam Optimizer already before I begin training. Does anyone have reasons they wouldn’t want to do this? My code block is below. This would likely change the best starting point to a much higher learning rate but might also help me avoid early stopping inappropriately because I didn’t select a good initial learning rate.
def create_optimizer():
with open(FLAGS.train_files,"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
decay_steps = int(row_count / Config.available_devices / FLAGS.train_batch_size) * FLAGS.es_steps # a hueristic
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = FLAGS.learning_rate
learning_rate = tf.train.exponential_decay(starter_learning_rate,
global_step,
decay_steps,
0.92,
staircase=True,
name='lr_decay')
# Passing global_step to minimize() will increment it at each step.
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
beta1=FLAGS.beta1,
beta2=FLAGS.beta2,
epsilon=FLAGS.epsilon)
return optimizer