Skip to content
All insights
TECHNICAL

Granite Rapids versus EPYC Turin for GPU server head nodes.

May 23, 2026 | 11 min read | Rillor Research
GRANITE / TURIN

In an 8-GPU Blackwell node, the GPUs are the asset and the host CPU is the plumbing. That framing is correct, and it is also where most node specifications go quietly wrong. The accelerators do the training. The CPU stages the data that keeps them fed, runs the dataloader, drives checkpoint I/O, and owns the PCIe and memory fabric that everything else hangs off of. Choose the wrong head node and you will not see a lower benchmark number on the GPUs. You will see a step-loading curve that never quite saturates, a checkpoint window that runs long, and a preprocessing stage that becomes the thing your schedule waits on.

This is a focused comparison of the two host CPUs you will actually be choosing between for a Blackwell HGX or DGX node: Intel's Xeon 6 P-core line built on Granite Rapids, and AMD's EPYC Turin (Zen 5) line. We are not comparing them as standalone server CPUs. We are comparing them in the one role that matters here, as the dual-socket host inside a complete OEM GPU system, where the question is not raw throughput but how well the CPU feeds eight Blackwell GPUs without becoming the bottleneck.

The two contenders, in the role that matters

Start with the parts, because the spec sheets settle several arguments before they begin.

On the Intel side, the flagship is the Xeon 6980P, a Granite Rapids-AP part with 128 P-cores and 256 threads, 504 MB of L3 cache, 12 DDR5 memory channels supporting DDR5-6400 or MRDIMM-8800, and 96 PCIe Gen5 lanes per socket at a 500W TDP. In a dual-socket node that is 192 PCIe Gen5 lanes and 24 memory channels of host fabric. That is the high end. The Xeon 6 line scales down from there, and the variants that actually ship in GPU nodes are usually not the 128-core flagship.

On the AMD side, Turin tops out for this role at the EPYC 9655: 96 cores, 192 threads, 384 MB of L3, 12-channel DDR5-6400 rated at 614 GB/s of bandwidth per socket, PCIe 5.0 x128 lanes, and a 400W default TDP on socket SP5. Its smaller sibling, the EPYC 9555, drops to 64 cores and 256 MB of L3 but carries the identical 12-channel DDR5-6400 memory and PCIe 5.0 x128 lane budget. That last point is the one to internalize about Turin: the platform-level resources (memory channels, lanes) are constant across the line, and core count is the variable. You do not give up lanes or channels to buy fewer cores.

SpecXeon 6980P (Granite Rapids-AP)EPYC 9655 (Turin)EPYC 9555 (Turin)
Cores / threads128 / 25696 / 19264 / 128
L3 cache504 MB384 MB256 MB
Memory channels12, DDR5-6400 / MRDIMM-880012, DDR5-640012, DDR5-6400
PCIe Gen5 lanes / socket96128128
TDP500W400W400W

Two things jump out. AMD wins on raw per-socket lane count, 128 versus 96. Intel wins on cache and on a memory option AMD does not match in this generation, MRDIMM-8800, which lifts per-channel bandwidth meaningfully when populated. Neither of those headline numbers tells you which CPU feeds eight GPUs better, because in a real node neither the lanes nor the channels are consumed the way the spec sheet implies. That is the part worth slowing down on.

What 8-GPU nodes actually ship with

Before reasoning about lanes, it helps to look at what shipping systems actually pair, because OEMs have already made these calls and they are instructive.

NVIDIA's own DGX B200 reference design uses two Intel Xeon Platinum 8570 host CPUs, 56 cores each, alongside 8 B200 GPUs totaling 1,440 GB of HBM3e, 8 ConnectX-7 ports plus 2 BlueField-3 DPUs, and a split NVMe layout of 2x 1.92 TB M.2 for the OS and 8x 3.84 TB U.2 for the data cache. System memory ships at 2 TB and expands to 4 TB. Note what the head node is: a mainstream dual-socket Xeon, not the 128-core flagship. The GPUs do the heavy compute; the host is sized to feed them and run the system, not to win a CPU benchmark.

The newer DGX B300 moves to the Intel Xeon 6776P as host CPU, a 64 P-core part with 8 memory channels, 88 PCIe 5.0 lanes, 336 MB of L3, and a 350W TDP. Intel pairs it with Priority Core Turbo and Speed Select Technology Turbo Frequency, per-core frequency controls marketed specifically to lift GPU performance on demanding AI workloads by letting the OS pin and boost the cores that are actually driving the accelerators. That is a useful tell about what the host CPU is for in this role. It is a data-movement and orchestration engine, and the feature Intel chose to advertise for it is about steering frequency to the threads feeding GPUs, not about aggregate core throughput.

On the merchant side, a shipping 8-GPU HGX B200 server such as ASRock Rack's 8U8X-GNR2 pairs dual Xeon 6 host CPUs with 8-channel 2DPC memory (32 DIMMs, roughly 2 TB), routes the GPUs and NICs through a dedicated PCIe switch board over MCIO cabling, uses NVIDIA ConnectX-7 (or BlueField-3) networking, and provides 10 NVMe drives, 8 for GPU data and 2 for boot. That topology is the single most important fact in this whole comparison, and it is the reason the per-socket lane count on the spec sheet is not the number you should be optimizing.

The lane budget, and why the switch board changes the math

Add up what an 8-GPU node has to connect: 8 Blackwell GPUs, each wanting a x16 Gen5 link to the host fabric, 8 ConnectX-7 or ConnectX-8 NICs at x16, one or two BlueField-3 DPUs, and a stack of NVMe (10 drives in the ASRock design, the DGX B200 split between M.2 boot and U.2 data cache). Naively, 8 GPUs at x16 alone is 128 lanes, before a single NIC or drive. No single CPU exposes that directly to the GPUs and still has lanes left for everything else.

It does not need to. In every shipping 8-GPU design, the GPUs and the network NICs hang off a PCIe switch board, not off the CPU's lanes one-to-one. The CPU provides uplinks into that switch complex, and the switch fans those uplinks out to the accelerators and NICs with the topology the design wants (typically GPU-NIC pairs sharing a switch so RDMA traffic between a GPU and its NIC never has to traverse the CPU at all). The MCIO cabling in the ASRock design is exactly this: the host board feeds the switch, the switch feeds the GPUs.

The practical consequence is that the host CPU's total lane count needs to cover the uplinks to the switch fabric, the management and storage paths, and the BlueField DPUs, not a direct x16 to every GPU. Both candidates clear this comfortably. AMD's 128 lanes per socket (256 per dual-socket node) gives more direct headroom and more flexibility for designs that want to attach more storage or additional NICs straight to the CPU. Intel's 96 lanes per socket (192 per node) on Granite Rapids-AP is still ample for the standard 8-GPU topology, and the DGX B300's Xeon 6776P does the job with just 88 lanes because the switch board absorbs the fan-out. If your design is lane-hungry beyond the reference topology (extra E1.S bays, a second storage tier, more than the standard NIC count attached directly), AMD's lane budget is the cleaner starting point. For the standard reference build, both are fine, and this is rarely the deciding factor. The fabric side of this (ConnectX-7 versus ConnectX-8, NVLink5) sits a layer above the host CPU, and the full catalog on the SKU index shows how each NIC and switch option is captured in the contracted configuration.

192 vs 256
Node PCIe Gen5 lanes, Intel vs AMD
12-ch
DDR5 channels per socket, both
~2 TB
Typical 8-channel B200 node memory

Memory channels set the ceiling for the working set

Memory is where the choice gets more interesting, because it directly sizes the host-side working set that preprocessing and the dataloader run inside.

Channel count caps two things: capacity and host memory bandwidth. An 8-channel host (the configuration in many shipping B200 nodes, and on the Xeon 6776P in the DGX B300) lands near 2 TB of system memory in a sensible 2DPC population, expandable toward 4 TB with the densest DIMMs. A 12-channel host, which both Granite Rapids-AP and the full Turin line provide, lifts that ceiling: 12 channels with high-capacity RDIMMs reaches roughly 6 TB of system memory per node, and adds proportionally more host memory bandwidth. AMD rates the EPYC 9655 at 614 GB/s per socket on DDR5-6400. Granite Rapids-AP matches the channel count and, with MRDIMM-8800 populated, pushes per-channel bandwidth higher than DDR5-6400 can, which is Intel's distinct memory-bandwidth advantage in this generation.

When does that ceiling matter? Not for the GPU compute, which feeds from HBM3e on the accelerators (1,440 GB across the eight B200s in a node). It matters for the host-resident working set: the dataset shards staged in page cache, the preprocessing buffers, the in-flight batches the dataloader is assembling, and the checkpoint state being marshaled to and from storage. A training job with heavy CPU-side augmentation, large tokenized shards held in cache, or a fat checkpoint footprint will use every gigabyte of host memory you give it, and will run the dataloader faster with more host memory bandwidth behind it. If your pipeline is that shape, a 12-channel host with a large memory configuration is worth specifying, and it is a reason to reach for the full Turin line or Granite Rapids-AP over an 8-channel head node. If your pipeline streams pre-tokenized data with light CPU work, an 8-channel 2 TB node is not the thing holding you back.

Where the CPU choice actually shows up

Strip away the spec-sheet duel and the host CPU influences a Blackwell node in exactly three places. Everything else is the same regardless of which logo is on the socket.

Data preprocessing

CPU-side augmentation, decoding, tokenization, and any non-trivial transform that runs before a batch reaches the GPU is pure host work. This is where core count and per-core frequency earn their keep. A pipeline that does heavy image decode-and-augment per sample, or runs a tokenizer on the fly, can starve the GPUs if the host cannot produce batches fast enough. Here the 96-core EPYC 9655 and the high-core Xeon 6 parts both bring real throughput, and Intel's Priority Core Turbo and SST-TF are a deliberate answer to this exact problem: pin and boost the cores doing the feeding. If your preprocessing is light or fully offloaded to the GPU, this dimension collapses and core count stops mattering. Know your pipeline before you pay for cores.

Dataloader throughput

The dataloader is the steady-state heartbeat of training, and it is bound by host memory bandwidth, page-cache capacity, and how fast worker threads can assemble and hand off batches. This is where 12 memory channels and a large host memory configuration pay back continuously, every step, for the life of the job. A node that can hold more of the dataset in page cache and feed batches with more memory bandwidth keeps GPU utilization higher with less effort. Both candidates are strong here at 12 channels; Intel's MRDIMM-8800 option is the bandwidth edge if you populate it.

Checkpoint I/O

Checkpointing is a burst: the job pauses, marshals state, and writes it through the host to NVMe (the 8 U.2 data drives in the DGX B200, the 8 GPU-data NVMe in the ASRock design). A wide host with ample lanes to storage and enough memory bandwidth to stage the write shortens that window, which matters because every second of checkpoint is a second the GPUs are idle. This is where AMD's larger lane budget can help a storage-heavy design and where host memory bandwidth helps both. None of this changes the GPU compute number. It changes how much of that compute you actually capture.

The honest verdict

For the standard 8-GPU Blackwell node, both Granite Rapids and EPYC Turin are correct choices, and the system around them matters more than the socket inside them. That is not a dodge; it is the result of the switch-board topology and the fact that both lines clear the lane and memory budget the role demands.

If you want a default, here it is. AMD EPYC Turin (the 9655 at 96 cores, or the 9555 at 64) gives you the largest per-socket lane budget, 12 channels, and a clean core-count-only decision across the line, which makes it an easy, flexible host for most builds. Intel's Xeon 6 line is the reference host for NVIDIA's own DGX B300 (the 6776P) and brings the MRDIMM-8800 memory-bandwidth option plus Priority Core Turbo, which are genuine advantages for memory-bandwidth-bound dataloaders and preprocessing-heavy pipelines. Pick AMD when lane headroom and core flexibility lead; pick Intel when host memory bandwidth and per-core boost for the feeding threads lead. The 128-core Xeon 6980P flagship is more CPU than a single GPU node needs and is better justified by node consolidation or CPU-side serving than by feeding eight GPUs.

What you should not do is treat the head node as an afterthought the channel fills in for you. The host CPU, the memory configuration, the NIC count, and the NVMe layout all ship inside the complete OEM system, and they are all specifiable. For how the rest of the node comes together at the system level, read this if you are procuring B200 systems in 2026 takes the full configuration from socket to switch.

Specifying the pairing on a forward contract

This is the part that connects the spec sheet to procurement. A Rillor SKU references a complete OEM GPU system, not a loose tray of accelerators, which means the host CPU pairing is part of what you are contracting for. RIL-GX-B200-2T is the standardized HGX B200 8-GPU system, and the head-node configuration (Granite Rapids versus Turin, core count, memory channels and capacity) is a spec you specify rather than inherit from whatever the channel happens to allocate. The same NVIDIA platform ships from Supermicro, Gigabyte, Dell, HPE, Lenovo ISG, ASRock Rack, and Aivres, so the host pairing you want is sourceable across competing OEMs at the best forward price for your delivery month. Browse the standardized contracts on the marketplace and the full catalog on the SKU index to see the configurations in practice.

Because the system is contracted forward, you lock the configuration and the delivery month together, with a 10% deposit at execution held by an independent escrow agent and the balance at delivery. The pairing you spec today is the pairing that arrives, on the date you chose, from a KYC'd seller bonded to deliver it. That is the difference between specifying a head node and hoping the one you get is the one you wanted.

PRICING

See the forward price on this system.

Request indicative pricing, lead time, and delivery windows for this SKU. Every quote runs through the standard Rillor contract, deposit, and escrow flow.

Request pricing
Sources & further reading
GET ACCESS

Trade the forward curve on Rillor.

Rillor is invite only. Verified buyers and sellers transact standardized forward contracts on OEM GPU systems, with physical delivery and independent escrow on every contract.

Become a Partner
NEWSLETTER

Get Rillor market reports in your inbox.

Allocation signals, forward-curve commentary, and product updates. No filler.