Hardware for training

Hi *,

at first I want to thank you for doing this project. This is just great.

I wonder whether you could talk a little bit more about the hardware you used to train the model? For example which mainboard did you use for the 8 Titan XP cards? Did you experience some heat/power issues? Could it be beneficial to use GPUs that have stronger half precision performance or is single precision mandatory? In general: What kind of hardware would you buy if you had to build a system like this today (and yes, I’m thinking about building a machine like that, that’s why I’m asking :slight_smile: )?

best regards,
Jochen

Right now we train on the following setup:

  • Headnode with 100 TB disk
  • 4 Worker Nodes
  • Each Worker Node (SuperMicro SuperServer 4028GR-TR/TRT)
    • 8 Titan X Pascal GPU’s
    • 10Gb Networking
    • 128 GB RAM

We’ve had no heat issues. However, each machine draws a lot of power, max 4k Watts, so you have to make sure they are properly supplied with power. We haven’t tried with half precision, but I’d guess half precision would work fine.

As to purchasing something today, I’d have to do a bit more research before giving an answer I’d feel confidant of, as I haven’t followed the very latest round of cards released from NVIDIA.

1 Like