For the first question, your CPU is probably bottlenecking your gpu; and what you wrote about higher batch size should also be correct. I’ve trained models with 64 batch size and haven’t noticed anything wonky with it.
You should use train.py. The batch_size in the config is not effective batch size, so if you want a batch_size of 32 and you have 2 GPUs, you should set your batch_size to 16(16*2).