Distributed training: optimal parameters

Hello everyone!

I have an access to a server with v100 gpu. So I tried to train a model there with batch size 32 for training and 16 for testing. Unfortunately, the gpu is not being used 100%. I mean, in average the load is 30-40%, occasionally it goes up to 85-90 % for a short time, and goes down to 14%. My question is, if I set the batch size, let’s say, 48 (with it set to 32 it uses 11gb of gpu ram), will the gpu load go up (training boosts) WHILE the final model performance will not be damaged?

After tests on single V100 I want to do distributed training. Should I just use distributed.py instead of train.py and provide the same config as for the single gpu training? Should I leave the same batch size?

Thank you a lot!

For the first question, your CPU is probably bottlenecking your gpu; and what you wrote about higher batch size should also be correct. I’ve trained models with 64 batch size and haven’t noticed anything wonky with it.

You should use train.py. The batch_size in the config is not effective batch size, so if you want a batch_size of 32 and you have 2 GPUs, you should set your batch_size to 16(16*2).

1 Like

Thank you a lot for your reply!

Today I’ve tried to use train.py for distributed training. What I end up with - I’ve been waiting for like half an hour, while one single python3 process was loading the CPU up to 100, multiple workers were loading the cpu up to 2-3%, in NVIDIA-SMI I saw at first nothing for like a 10 minutes, then one process started to use ram of the 1st of 4 gpus. So 20 minutes later I decided to give distributed.py a try. After like 10-15 seconds epoch started and all 4 gpu were equally involved. Did I do something wrong? train.py saw I have 4 GPU. According to its code, it seems like it should word (it imports some important functions from distributed.py actually). I used for both cases the same config.
Thank you!

You aren’t doing anything wrong. That is probably the intended usage. I use apex to launch training processes(using train.py, otherwise it hangs like you said). You should be good.

Thank you a lot, will try now! Actually, with 4x Tesla V100 training seems to be equal speed as with a signle RTX2080TI. So my last hope is to make train.py work!

I don’t think thats your problem. 4 V100s afford you a lot of memory which means you can try out larger batchsizes which should speed up your training. I am not sure how a large(>64) batchsize is going to affect your training though.

If your effective batchsize on the single gpu is the same as that on 4 gpus, you shouldn’t be seeing much gain on time anyway.

I could be wrong; I’d wait for second opinions. @carlfm01 Any insights on this?

Thank you for the reply!

I’ve tried to install apex, but I guess due to CUDA mismatch (I have 10.1, apex requires 10.0 (or PyTorch)) even though I commented this check and it says apex is installed, train.py still doesn’t work.
Anyway, if it isn’t suppose to improve the speed, let it go. But, here’s the question. Why Erogol used two GPU and total batch size was 32? What’s the point than to use two gpu, while it is not supposed to improve the speed?

for the VRAM I suppose; like i said, i am not sure if my assumptions were right. Let’s ask him @erogol.

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO python -m apex.parallel.multiproc /tts/TTS/train.py --config_path /config_moz_runningtests.json --restore_path ./results/r2_test-September-30-2019_10+27AM-/checkpoint_237000.pth.tar

You’re gonna need that.(check the bold parts)
Also, Absolute paths are necessary when you use this.

Hello, as much I can recall I tried using multi gpu while I was trying to find out correct hparams, so I can’t really tell if an improvement came from multi gpu or hparams change. I was also trying different pytorch versions to fix my issue of slow training, so any suggestion is biased due to changes from run to run.

Thank you a lof for that command, I just had to add ‘–world_size’ key to train.py for some reasons, but it worked! The speed remained the same, though. It’s indeed slower than on the single GPU… two times slower.

This is necessary too. I missed that. But I believe you were on the right path with distribute.py anyway.

1 Like


Did you succeed to gain some reasonable speed improvement after all? Is it worth trying multiple GPU at all?

I use multiple GPUs to be able to train on an effective batch_size of 32 ( my exemplars are quite long). I believe, if you use a larger batch_size you will achieve speed up, but I dont know how that’s going to affect your training. Like I said, erogol could authoritatively clarify this; let’s wait for his reply.

1 Like

@alchemi5t does apex work on python3?

Yes it does.

Could you please clear the question about the speed up? Are my assumptions right?

but if I pip install apex and call the command you have above it raises python3 incompatible issues. Maybe the way I installed it is wrong.

Yes it is mostyl CPU bottlenecking. Any if your sequences alongates, it takes even more time. I don’t count the first epoch since it computes phonemes and caches them.

If there is something technical, there is a good pointer https://github.com/espnet/espnet#error-due-to-acs-multiple-gpus

1 Like

could you please post the log?

I’ll with my next run.

Could you also say it performs better than TTS distributed?

Give me a few days(I do not have access to my machine right now.), I will benchmark them and give you details.

My 2 cents, I do not believe it’s faster by much. Should equally time consuming or slightly better.

Thx. That would be great. I also plan to have apex lower precision training in TTS soon.

1 Like