Distributed training set up of DeepSpeech code

vishalpagidipally · December 5, 2017, 10:38am

Hi,

I am currently trying to deploy the deep speech code on Google Cloud ML and am facing a few issues in doing that. My issue mainly comes from the Coordinator Class. The code gets deployed on all the nodes but only the master starts the training while the worker throws up errors related to threads and shuts down.
Is this the right forum to get this issue answered? Because I could not find an answer to this issue on the github issues.

jageshmaharjan · May 17, 2018, 3:18am

Hi @vishalpagidipally, were you able to do distributed training on the DeepSpeech. I was wondering with the same situation.
If you had figured out, can you show us your training script.

kdavis · May 18, 2018, 12:42pm

@Tilman_Kamp Could you suggest anything here?

Tilman_Kamp · May 22, 2018, 7:27am

Simple things first: Only worker 0 (the master worker) is saving checkpoint files.

Regarding the threading issues on non-master workers: I once had similar problems with the current distributed training code in master.
@jageshmaharjan: Could you provide some context information and a stack trace? Please open a new issue on Github and assign me (tilmankamp).

jageshmaharjan · May 22, 2018, 8:47am

Ok, sure. Thank you.

mingzhibai · July 10, 2018, 2:02am

How can I get rid of Coordinator and transform it into normal ps mode distributed code?
could you share your code?
In our cluster environment, http communication is restricted, so current distributed implementation is not deployable.

Could you share some experience?
thank you!

deepthi.karkada · July 18, 2018, 5:47pm

Hello! We have done some work to get rid of the Coordinator code and have even done Horovod integration into the current distributed training code. Please @Tilman_Kamp, could you let me know the best way to make this available, either through a PR or something else? Thanks!

jageshmaharjan · July 20, 2018, 3:42am

@deepthi.karkada, I was also looking forward to use Horovod to do distributed training. And, coz of the Coordinator code it didn’t allow me to set -np more
than 1, I kept getting an error saying “Address Already in use”. However, when I ran with-np 1 with mpirun, I observed all my GPU were being used and respective numbers of processes were being created.

Care to share your PR with changes in Coordinator code.

The one I did, was just a small change without modifying the Coordinator code.

def main(_):
hvd.init()

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())

initialize_globals()

opt = create_optimizer(hvd.size())
opt = hvd.DistributedOptimizer(opt)

hooks = [
    hvd.BroadcastGlobalVariablesHook(0)
]

print("hvd_Size: " + str(hvd.size()))
print("hvd_Rank: " + str(hvd.rank()))
FLAGS.checkpoint_dir = FLAGS.checkpoint_dir if hvd.rank() == 0 else None

train(opt, hooks)

# Are we the main process?
if is_chief:
    # Doing solo/post-processing work just on the main process...
    # Exporting the model
    if FLAGS.export_dir:
        export()

if len(FLAGS.one_shot_infer):
    do_single_file_inference(FLAGS.one_shot_infer)

# Stopping the coordinator
COORD.stop()

jageshmaharjan · July 20, 2018, 3:37am

Have you look at this one: