Thank you a lot for your reply!
Today I’ve tried to use train.py for distributed training. What I end up with - I’ve been waiting for like half an hour, while one single python3 process was loading the CPU up to 100, multiple workers were loading the cpu up to 2-3%, in NVIDIA-SMI I saw at first nothing for like a 10 minutes, then one process started to use ram of the 1st of 4 gpus. So 20 minutes later I decided to give distributed.py a try. After like 10-15 seconds epoch started and all 4 gpu were equally involved. Did I do something wrong? train.py saw I have 4 GPU. According to its code, it seems like it should word (it imports some important functions from distributed.py actually). I used for both cases the same config.
Thank you!