Distributed training: optimal parameters

Thank you a lot for your reply!

Today I’ve tried to use train.py for distributed training. What I end up with - I’ve been waiting for like half an hour, while one single python3 process was loading the CPU up to 100, multiple workers were loading the cpu up to 2-3%, in NVIDIA-SMI I saw at first nothing for like a 10 minutes, then one process started to use ram of the 1st of 4 gpus. So 20 minutes later I decided to give distributed.py a try. After like 10-15 seconds epoch started and all 4 gpu were equally involved. Did I do something wrong? train.py saw I have 4 GPU. According to its code, it seems like it should word (it imports some important functions from distributed.py actually). I used for both cases the same config.
Thank you!

You aren’t doing anything wrong. That is probably the intended usage. I use apex to launch training processes(using train.py, otherwise it hangs like you said). You should be good.

Thank you a lot, will try now! Actually, with 4x Tesla V100 training seems to be equal speed as with a signle RTX2080TI. So my last hope is to make train.py work!

I don’t think thats your problem. 4 V100s afford you a lot of memory which means you can try out larger batchsizes which should speed up your training. I am not sure how a large(>64) batchsize is going to affect your training though.

If your effective batchsize on the single gpu is the same as that on 4 gpus, you shouldn’t be seeing much gain on time anyway.

I could be wrong; I’d wait for second opinions. @carlfm01 Any insights on this?

Thank you for the reply!

I’ve tried to install apex, but I guess due to CUDA mismatch (I have 10.1, apex requires 10.0 (or PyTorch)) even though I commented this check and it says apex is installed, train.py still doesn’t work.
Anyway, if it isn’t suppose to improve the speed, let it go. But, here’s the question. Why Erogol used two GPU and total batch size was 32? What’s the point than to use two gpu, while it is not supposed to improve the speed?

for the VRAM I suppose; like i said, i am not sure if my assumptions were right. Let’s ask him @erogol.

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO python -m apex.parallel.multiproc /tts/TTS/train.py --config_path /config_moz_runningtests.json --restore_path ./results/r2_test-September-30-2019_10+27AM-/checkpoint_237000.pth.tar

You’re gonna need that.(check the bold parts)
Also, Absolute paths are necessary when you use this.

Hello, as much I can recall I tried using multi gpu while I was trying to find out correct hparams, so I can’t really tell if an improvement came from multi gpu or hparams change. I was also trying different pytorch versions to fix my issue of slow training, so any suggestion is biased due to changes from run to run.

Thank you a lof for that command, I just had to add ‘–world_size’ key to train.py for some reasons, but it worked! The speed remained the same, though. It’s indeed slower than on the single GPU… two times slower.

This is necessary too. I missed that. But I believe you were on the right path with distribute.py anyway.

1 Like

Hello!

Did you succeed to gain some reasonable speed improvement after all? Is it worth trying multiple GPU at all?

I use multiple GPUs to be able to train on an effective batch_size of 32 ( my exemplars are quite long). I believe, if you use a larger batch_size you will achieve speed up, but I dont know how that’s going to affect your training. Like I said, erogol could authoritatively clarify this; let’s wait for his reply.

1 Like

@alchemi5t does apex work on python3?

Yes it does.

Could you please clear the question about the speed up? Are my assumptions right?

but if I pip install apex and call the command you have above it raises python3 incompatible issues. Maybe the way I installed it is wrong.

Yes it is mostyl CPU bottlenecking. Any if your sequences alongates, it takes even more time. I don’t count the first epoch since it computes phonemes and caches them.

If there is something technical, there is a good pointer https://github.com/espnet/espnet#error-due-to-acs-multiple-gpus

1 Like

could you please post the log?

I’ll with my next run.

Could you also say it performs better than TTS distributed?

Give me a few days(I do not have access to my machine right now.), I will benchmark them and give you details.

My 2 cents, I do not believe it’s faster by much. Should equally time consuming or slightly better.

Thx. That would be great. I also plan to have apex lower precision training in TTS soon.

1 Like

BTW, one of the bottlenecks is that all the GPUs are limited by the longest sequence in all the batches. Since as you use more GPUs you have larger global batch size and it increases the gap between GPUs. So it stalls more the GPUs with shorter sequences.

So, after all, the only point using multiple GPU is to comply recommended batch size if not feasible with the single GPU?
But what all that info regarding 1.8 speedup with 2xRTX2080ti vs 1xRTX2080ti on the internet about? e.g. here: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
My only suggestion is if we use for single GPU batch size let’s say 64, which is max possible for the GPU, and with multiple GPU we use also 64 but per GPU, so in total we have 128 or 256 or so batch size - that leads to speed improvement (as the batch size is bigger). because if we use TOTAL batch size the same in both cases, multi-gpu setup will loose.