Distributed training: optimal parameters

BTW, one of the bottlenecks is that all the GPUs are limited by the longest sequence in all the batches. Since as you use more GPUs you have larger global batch size and it increases the gap between GPUs. So it stalls more the GPUs with shorter sequences.

So, after all, the only point using multiple GPU is to comply recommended batch size if not feasible with the single GPU?
But what all that info regarding 1.8 speedup with 2xRTX2080ti vs 1xRTX2080ti on the internet about? e.g. here: https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
My only suggestion is if we use for single GPU batch size letā€™s say 64, which is max possible for the GPU, and with multiple GPU we use also 64 but per GPU, so in total we have 128 or 256 or so batch size - that leads to speed improvement (as the batch size is bigger). because if we use TOTAL batch size the same in both cases, multi-gpu setup will loose.

Thatā€™s what iā€™ve been saying.

Itā€™s just that this is not a very easy network to train and there arenā€™t any documented experiments showing if larger batchsize screws with the training or not. You could experiment with it and let us know.( like i did for 64 large batches which was not documented before). Good luck!

1 Like