@deepthi.karkada, I was also looking forward to use Horovod to do distributed training. And, coz of the Coordinator code it didn’t allow me to set
than 1, I kept getting an error saying “Address Already in use”. However, when I ran with
-np 1 with
mpirun, I observed all my GPU were being used and respective numbers of processes were being created.
Care to share your PR with changes in
The one I did, was just a small change without modifying the Coordinator code.
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
opt = create_optimizer(hvd.size())
opt = hvd.DistributedOptimizer(opt)
hooks = [
print("hvd_Size: " + str(hvd.size()))
print("hvd_Rank: " + str(hvd.rank()))
FLAGS.checkpoint_dir = FLAGS.checkpoint_dir if hvd.rank() == 0 else None
# Are we the main process?
# Doing solo/post-processing work just on the main process...
# Exporting the model
# Stopping the coordinator