Unfolding the universe of possibilities..

Navigating the waves of the web ocean

How to Build a Multi-GPU System for Deep Learning in 2023

My deep learning build — work in progress :).

This story provides a guide on how to build a multi-GPU system for deep learning and hopefully save you some research time and experimentation.

Target

Build a multi-GPU system for training of computer vision and LLMs models without breaking the bank! 🏦

Step 1. GPUs

Let’s start with the fun (and expensive 💸💸💸) part!

The H100 beast! Image from NVIDIA.

The main considerations when buying a GPU are:

memory (VRAM)performance (Tensor cores, clock speed)slot widthpower (TDP)

Memory

For deep learning tasks nowadays we need a loooot of memory. LLMs are huge even to fine-tune and computer vision tasks can get memory-intensive especially with 3D networks. Naturally the most important aspect to look for is the GPU VRAM. For LLMs I recommend at least 24 GB memory and for computer vision tasks I wouldn’t go below 12 GB.

Performance

The second criterion is performance which can be estimated with FLOPS (Floating-point Operations per Second):

The crucial number in the past was the number of CUDA cores in the circuit. However, with the emergence of deep learning, NVIDIA has introduced specialized tensor cores that can perform many more FMA (Fused Multiply-Add) operations per clock. These are already supported by the main deep learning frameworks and are what you should look for in 2023.

Below you can find a chart of raw performance of GPUs grouped by memory that I compiled after quite some manual work:

Raw performance of GPUs based on the CUDA and tensor cores (TFLOPs).

Note that you have to be extra careful when comparing performance of different GPUs. Tensor cores of different generations / architectures are not comparable. For instance, the A100 performs 256 FP16 FMA operations / clock while the V100 “only” 64. Additionally, older architectures (Turing, Volta) do not support 32-bit tensor operations. What makes the comparison more difficult is that NVIDIA doesn’t always report the FMA, not even in the whitepapers, and GPUs of the same architecture can have different FMAs. I kept banging my head with this 😵‍💫. Also note that NVIDIA often advertises the tensor FLOPS with sparsity which is a feature usable only at inference time.

In order to identify the best GPU with respect to price, I collected the ebay prices using the ebay API and computed the relative performance per dollar (USD) for new cards:

Relative performance per USD of GPUs based on the CUDA and tensor cores (TFLOPs / USD). Prices are based on current ebay prices (September 2023).

I did the same for used cards but since the rankings don’t change too much I omit the plot.

To select the best GPU for your budget, you can pick one of the top GPUs for the largest memory you can afford. My recommendation would be:

Recommendation of GPUs for different budgets based on current ebay prices (September 2023).

If you want to dive into more technical aspects I advise to read Tim Dettmers’ excellent guide on Which GPU(s) to Get for Deep Learning.

Slot width

When building a multi-GPU system, we need to plan how to physically fit the GPUs into a PC case. Since GPUs grow larger and larger, especially the gaming series, this becomes more of an issue. Consumer motherboards have up to 7 PCIe slots and PC cases are built around this setup. A 4090 can easily take up 4 slots depending on manufacturer, so you can see why this becomes an issue. Additionally we should leave at least 1 slot between GPUs that are not blower style or watercooled to avoid overheating. We have the following options:

Watercooling
Watercooled variants will take up to 2 slots but they are more expensive. You can alternatively convert an air-cooled GPU but this will void the warranty. If you don’t get All-in-One (AIO) solutions you will need to build a custom watercooling loop. This is also true if you want to fit multiple watercooled GPUs since the AIO radiators may not fit in the case. Building your own loop is risky and I wouldn’t personally do it with expensive cards. I would only buy AIO solutions straight from the manufactures (risk averse 🙈).

Aircooled 2–3 slot cards and PCIe risers
In this scenario you interleave cards on PCIe slots and cards connected with PCIe riser cables. The PCIe riser cards can be placed somewhere inside the PC case or in the open air. In either case you should make sure the GPUs are secured (see also the section about PC cases).

Power (TDP)

Modern GPUs get more and more power hungry. For instance, A 4090 requires 450 W while a H100 can get up to 700 W. Apart from the power bill, fitting three or more GPUs becomes an issue. This is especially true in the US that the power sockets can deliver up to around 1800w.

A solution to this problem if you are getting close to the max power you can draw from your PSU / power socket is power-limiting. All you need to reduce the max power a GPU can draw is:

sudo nvidia-smi -i <GPU_index> -pl <power_limit>

where:
GPU_index: the index (number) of the card as it shown with nvidia-smi
power_limit: the power in W you want to use

Power-limiting by 10-20% has been shown to reduce performance by less than 5% and keeps the cards cooler (experiment by Puget Systems). Power-limiting four 3090s for instance by 20% will reduce their consumption to 1120w and can easily fit in a 1600w PSU / 1800w socket (assuming 400w for the rest of the components).

Step 2. Motherboard and CPU

The next step of the build is to pick a motherboard that allows multiple GPUs. Here the main consideration is the PCIe lanes. We need at minimum PCIe 3.0 slots with x8 lanes each for each of the cards (see Tim Dettmers’ post). PCIe 4.0 or 5.0 are rarer and not needed for most deep learning usecases.

Apart from the slot type, the spacing of the slots will determine where you can place the GPUs. Make sure you have checked the spacing and that your GPUs can actually go where you want them to. Note that most motherboards will use x8 configuration for some x16 slots when you use multiple GPUs. The only real way to get this information is on the manual of the card.

The easiest way to not spend hours of research and also future-proof your system is to pick a motherboard that has x16 slots everywhere. You can use PCPartPicker and filter motherboards that have 7+ PCIe x16 slots. This gives us 21 products to choose from. We then reduce the list by selecting the minimum amount of RAM we want (e.g. 128 GB) with DDR4 / DDR5 type to bring it down to 10 products:

Motherboards with at least 7 PCIe x16 slots and 128 GB DDR4/DDR5 RAM based on PCPartPicker.

The supported CPU sockets of the above list are LGA2011–3 and LGA2066. We then move to the CPU selection and select CPUs with the desired number of cores. These are mainly needed for data loading and batch preparation. Aim to have at least 2 cores / 4 threads per GPU. For the CPU we should also check the PCIe lanes it supports. Any CPU of the last decade should support at least 40 lanes (covering 4 GPUs at x8 lanes) but better be safe than sorry. With a filtering of e.g. 16+ cores with the above sockets we get the following CPUs:

Intel Xeon E5 (LGA2011–3): 8 resultsIntel Core i9 (LGA2066): 9 results

We then pick our favorite combination of motherboard and CPU based on the number of cores, availability and price.

Both LGA2011–3 and LGA2066 sockets are very old (2014 and 2017 respectively), and therefore you can find good deals on ebay for both the motherboard and CPU. An ASRock X99 WS-E motherboard and a 18-core Intel Xeon E5–2697 V4 can cost you less than 300$ in used condition. Don’t buy the cheaper ES or QS versions for CPUs as these are engineering samples and may fail ⚠️️.

If you want to buy something more powerful and/or more recent and/or an AMD CPU you can look into motherboards with e.g. 4+ PCIe x16 slots but make sure you check the slot spacings.

At this stage it’s a good idea to start a PCPartPicker build. 🛠️
PCPartPicker will check compatibilities between components for you and will make your life easier.

Step 3. RAM 🐏

Here the most important aspect is the amount of RAM. RAM is used in different places of the deep learning cycle: loading data from disk for batch creation, loading the model and of course prototyping. The amount needed depends a lot on your application (e.g. 3D image data will need much more additional RAM) but you should aim for 1x–2x the total amount of VRAM of your GPUs. The type should be at least DDR4 but the RAM clock is not very important, so don’t spend your money there 🕳️.

When buying RAM you should make sure that the form factor, type, number of modules and memory per module all agree with your motherboard specs (PCPartPicker is your friend!).

Step 4. Disks

Another component that you can save on is the disks 😌. Again the amount of disk space is important and depends on the application. You don’t necessarily need ultra-fast disks or NVMEs as they won’t affect your deep learning performance. The data will be anyway loaded to RAM and in order to not create a bottleneck you can simply use more parallel CPU workers.

Step 5. Power supply (PSU) 🔌

As we saw GPUs are power-hungry components. When setting up a multi-GPU system, the selection of the PSU becomes an important consideration. The majority of PSUs can deliver up to 1600w — this is in line with the power limits of US sockets. There are a few PSUs that can deliver more than that but need some research and they target especially miners.

Estimated wattage provided by PCPartPicker for your builds.

To determine the wattage of your system, you can use again PCPartPicker that computes the total amount of your build. To this we need to add an extra 10%+ for peace of mind since GPUs will have spikes of power more than what is on their specs.

An important criterion is the PSU efficiency that is marked with the 80 PLUS rating. The supply will reach the wattage it advertises but will lose some power in the process. 80 PLUS Bronze supplies are rated with 82% efficiency vs e.g. a Gold that will reach 87% efficiency. If we have a system that draws 1600w and we use it 20% of the time, we would save 22$ per year with a GPU with Gold rating, assuming a cost of 0.16$ / KWh. When comparing prices take that into account in your calculations.

PSU efficiency ratings. Table from techguided.

When running at full load some PSUs are more noisy than others since they use a fan at high RPMs. If you are working (or sleeping!) close to your case this can have some effect, so it’s a good idea to check the decibels from the manual 😵.

When selecting a supply, we need to verify that it has enough connectors for all our parts. GPUs in particular use 8 (or 6+2) pin cables. One important note here is that for each power slot of the GPU we should use a separate 8 pin cable and not use multiple outputs of the same cable (daisy-chaining). 8 pin cables are generally rated to ~150w. When using a single cable for more than one power slot the GPU may not get enough power and throttle.

Step 6. PC case

Last but not least, selecting a PC case is not trivial. GPUs can get humongous and some cases will not fit them. A 4090 for instance can reach 36 cm length 👻!

On top of that, mounting GPUs with PCIe risers may require some hacks. There are some some newer cases that allow to mount an additional card, especially dual system cases like the Phanteks Enthoo 719. Another option is the Lian-Li O11D EVO that can house a GPU in upright position with the Lian-Li Upright GPU Bracket. I don’t have these cases so I’m not sure how well they would fit e.g. multiple 3090 / 4090. However you can still mount a GPU upright even if your PC case doesn’t directly support it with the Lian-Li bracket. You will need to drill 2–3 holes to the case but is not crazy (guide to follow!).

Mounting a Titan Xp in an upright position with the Lian Li upright bracket.

The end

I hope you enjoyed reading this guide and that you found some useful tips. The guide is aimed to help in your research on building a multi-GPU system, and not replace it. Feel free to send me any questions or comments you may have. If I am wrong on anything in the above I would really appreciate a comment or DM to make it even better 🙏!

Note: Unless otherwise noted, all images are by the author. I have included some affiliate Amazon links. Buying an item through the links comes at no extra cost and I could potentially receive a small commission.

How to Build a Multi-GPU System for Deep Learning in 2023 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment