Distributed training: optimal parameters

alchemi5t · September 26, 2019, 12:48pm

For the first question, your CPU is probably bottlenecking your gpu; and what you wrote about higher batch size should also be correct. I’ve trained models with 64 batch size and haven’t noticed anything wonky with it.

You should use train.py. The batch_size in the config is not effective batch size, so if you want a batch_size of 32 and you have 2 GPUs, you should set your batch_size to 16(16*2).

vcjobacc · October 1, 2019, 2:16pm

Thank you a lot for your reply!

Today I’ve tried to use train.py for distributed training. What I end up with - I’ve been waiting for like half an hour, while one single python3 process was loading the CPU up to 100, multiple workers were loading the cpu up to 2-3%, in NVIDIA-SMI I saw at first nothing for like a 10 minutes, then one process started to use ram of the 1st of 4 gpus. So 20 minutes later I decided to give distributed.py a try. After like 10-15 seconds epoch started and all 4 gpu were equally involved. Did I do something wrong? train.py saw I have 4 GPU. According to its code, it seems like it should word (it imports some important functions from distributed.py actually). I used for both cases the same config.
Thank you!

alchemi5t · October 1, 2019, 2:29pm

You aren’t doing anything wrong. That is probably the intended usage. I use apex to launch training processes(using train.py, otherwise it hangs like you said). You should be good.

vcjobacc · October 1, 2019, 2:31pm

Thank you a lot, will try now! Actually, with 4x Tesla V100 training seems to be equal speed as with a signle RTX2080TI. So my last hope is to make train.py work!

alchemi5t · October 1, 2019, 2:52pm

I don’t think thats your problem. 4 V100s afford you a lot of memory which means you can try out larger batchsizes which should speed up your training. I am not sure how a large(>64) batchsize is going to affect your training though.

If your effective batchsize on the single gpu is the same as that on 4 gpus, you shouldn’t be seeing much gain on time anyway.

I could be wrong; I’d wait for second opinions. @carlfm01 Any insights on this?

vcjobacc · October 1, 2019, 3:41pm

Thank you for the reply!

I’ve tried to install apex, but I guess due to CUDA mismatch (I have 10.1, apex requires 10.0 (or PyTorch)) even though I commented this check and it says apex is installed, train.py still doesn’t work.
Anyway, if it isn’t suppose to improve the speed, let it go. But, here’s the question. Why Erogol used two GPU and total batch size was 32? What’s the point than to use two gpu, while it is not supposed to improve the speed?

alchemi5t · October 1, 2019, 3:58pm

for the VRAM I suppose; like i said, i am not sure if my assumptions were right. Let’s ask him @erogol.

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO python -m apex.parallel.multiproc /tts/TTS/train.py --config_path /config_moz_runningtests.json --restore_path ./results/r2_test-September-30-2019_10+27AM-/checkpoint_237000.pth.tar

You’re gonna need that.(check the bold parts)
Also, Absolute paths are necessary when you use this.

carlfm01 · October 2, 2019, 3:38am

Hello, as much I can recall I tried using multi gpu while I was trying to find out correct hparams, so I can’t really tell if an improvement came from multi gpu or hparams change. I was also trying different pytorch versions to fix my issue of slow training, so any suggestion is biased due to changes from run to run.

vcjobacc · October 2, 2019, 6:37am

Thank you a lof for that command, I just had to add ‘–world_size’ key to train.py for some reasons, but it worked! The speed remained the same, though. It’s indeed slower than on the single GPU… two times slower.

alchemi5t · October 2, 2019, 6:29am

This is necessary too. I missed that. But I believe you were on the right path with distribute.py anyway.

vcjobacc · October 2, 2019, 6:29am

Hello!

Did you succeed to gain some reasonable speed improvement after all? Is it worth trying multiple GPU at all?

alchemi5t · October 2, 2019, 6:32am

I use multiple GPUs to be able to train on an effective batch_size of 32 ( my exemplars are quite long). I believe, if you use a larger batch_size you will achieve speed up, but I dont know how that’s going to affect your training. Like I said, erogol could authoritatively clarify this; let’s wait for his reply.

erogol · October 2, 2019, 9:46am

@alchemi5t does apex work on python3?

alchemi5t · October 2, 2019, 9:50am

Yes it does.

Could you please clear the question about the speed up? Are my assumptions right?

erogol · October 2, 2019, 9:56am

but if I pip install apex and call the command you have above it raises python3 incompatible issues. Maybe the way I installed it is wrong.

Yes it is mostyl CPU bottlenecking. Any if your sequences alongates, it takes even more time. I don’t count the first epoch since it computes phonemes and caches them.

If there is something technical, there is a good pointer https://github.com/espnet/espnet#error-due-to-acs-multiple-gpus

alchemi5t · October 2, 2019, 9:57am

could you please post the log?

erogol · October 2, 2019, 9:58am

I’ll with my next run.

Could you also say it performs better than TTS distributed?

alchemi5t · October 2, 2019, 10:01am

Give me a few days(I do not have access to my machine right now.), I will benchmark them and give you details.

My 2 cents, I do not believe it’s faster by much. Should equally time consuming or slightly better.

erogol · October 2, 2019, 10:05am

Thx. That would be great. I also plan to have apex lower precision training in TTS soon.

erogol · October 2, 2019, 10:10am

BTW, one of the bottlenecks is that all the GPUs are limited by the longest sequence in all the batches. Since as you use more GPUs you have larger global batch size and it increases the gap between GPUs. So it stalls more the GPUs with shorter sequences.