Hangs on dist.init_process_group in distribute.py

alchemi5t · September 5, 2019, 7:16am

I’ve been trying to use multiple gpus for training but the training hangs(Responsive, but indefinitely stays here without aby output or error) at initializing process group. Any insights on how to fix this?

Tried pytorch 0.4.1 and 1+.

  "distributed":{
        "backend": "nccl",
        "url": "tcp:\/\/localhost:23456"
    },

erogol · September 5, 2019, 10:50am

I don’t have something to replicate your problem hence hard to guess. But it might be about something your machine’s setup, or pytorch version or dataloader etc.

alchemi5t · September 19, 2019, 4:54pm

Fixed it by installing apex. Not sure why i needed it though.