The release notes mention that v0.8.2 was trained with 8 GPUs and a training batch size of 128.
So my question is that what was the actual batch size?
- Was it 128 per GPU, thus making and an effective batch size of
128*8 = 1024
- Or, was it 128 across all 8 GPUs, thus making an effective batch size of
16
per GPU.