I’ve been trying to use multiple gpus for training but the training hangs(Responsive, but indefinitely stays here without aby output or error) at initializing process group. Any insights on how to fix this?
I don’t have something to replicate your problem hence hard to guess. But it might be about something your machine’s setup, or pytorch version or dataloader etc.