A GPU cluster has two networks, and the most common architecture mistake is treating them as one. The first is the scale-up fabric: NVLink5, which fuses a set of GPUs into a single coherent domain so they behave, for the purposes of a tensor-parallel shard, like one very large accelerator. The second is the scale-out fabric: the ConnectX cards on each baseboard that carry traffic between nodes across a switched network. They operate at different bandwidths, different latencies, and different topological scopes, and a cluster is only as good as the weaker of the two for whatever collective your job actually runs.
This explainer separates those two layers, then walks the generational shift that matters most this cycle: B200-class HGX systems ship eight ConnectX-7 adapters at 400 Gb/s for east-west traffic, while B300-class HGX systems move to eight ConnectX-8 SuperNICs at 800 Gb/s on the baseboard. Doubling per-GPU east-west bandwidth is not a spec-sheet footnote. It changes how large a model you can train at a given step time, and it changes what you should be willing to pay for the system that carries it.
The two fabrics, defined
Scale-up is what happens inside a single coherent GPU domain. On an HGX B200 or B300 baseboard, eight GPUs are wired together with fifth-generation NVLink and an NVLink switch complex so that any GPU can read or write any other GPU's high-bandwidth memory at full fabric speed. NVLink5 delivers roughly 1.8 TB/s per GPU, which is 18 links at 100 GB/s each and on the order of fourteen times the bandwidth of a PCIe Gen5 path. At rack scale, the GB200 NVL72 extends the same NVLink5 fabric across all 72 GPUs in the rack for about 130 TB/s of aggregate low-latency GPU-to-GPU bandwidth. Inside that domain there is no Ethernet, no InfiniBand, and no NIC in the data path. The domain is the unit that tensor parallelism wants to live inside.
Scale-out is everything past the edge of that domain. Once a job spans more baseboards or more racks than a single NVLink domain can hold, the GPUs talk over a switched network through the ConnectX adapters. This is the plane that carries data-parallel gradient all-reduces, pipeline-parallel activations between stages, and any collective whose participants are spread across nodes. It runs at a fraction of NVLink bandwidth, it traverses switches, and it is where topology, congestion control, and adaptive routing earn their keep.
The practical rule for an architect: keep the most bandwidth-hungry parallelism dimension inside the NVLink domain, and let scale-out carry the dimensions that tolerate more latency and less bandwidth. The size of your NVLink domain (8 on an HGX baseboard, 72 on an NVL72 rack) and the speed of your ConnectX plane together set the ceiling on the model and batch you can run efficiently.
ConnectX-7 to ConnectX-8: what doubles, and why it matters
On a B200-class HGX system, the baseboard carries eight ConnectX-7 adapters, one per GPU. Each ConnectX-7 delivers 400 Gb/s and can run either NDR InfiniBand or 400G Ethernet. That 1:1 GPU-to-NIC ratio is deliberate: it gives every GPU a dedicated rail out of the node so that scale-out collectives are not bottlenecked by a shared uplink.
B300-class HGX systems keep the 1:1 ratio and move the card generation forward. The B300 baseboard carries eight ConnectX-8 SuperNICs, each delivering 800 Gb/s, which NVIDIA's reference architecture describes as two 400 Gb/s Ethernet ports per GPU for east-west networking. The ConnectX-8 itself supports up to 800 Gb/s total, runs InfiniBand at 800/400/200/100 or Ethernet at 400/200/100/50/25, and sits on a PCIe Gen6 host interface so the card is not throttled by the bus behind it.
Why does doubling east-west bandwidth matter when NVLink is already an order of magnitude faster than either NIC generation? Because the scale-out plane is what gates large-cluster training. When a job spans thousands of GPUs, the all-reduce that synchronizes gradients every step has to move a volume of data proportional to the model size, and it has to do it across the ConnectX fabric, not across NVLink. The time that collective takes is set by the slowest link each byte must cross. Halving the time a node spends pushing gradients out and pulling them back in raises the fraction of each step spent on useful compute rather than communication. For a training run measured in weeks, the difference between a 400 Gb/s and an 800 Gb/s east-west plane is a difference in delivered tokens per dollar, not a benchmark curiosity.
The corollary is that the upgrade only pays off if the rest of the path keeps up. An 800 Gb/s NIC into a 400 Gb/s switch port buys you nothing. The B300 fabric is a system property: ConnectX-8 on the baseboard, 800G-capable switch ports, and cabling rated for the rate. This is exactly the kind of end-to-end consistency that is easy to get wrong when components are sourced piecemeal, and we come back to it below.
InfiniBand or Ethernet: the card runs either, the switch decides
A frequent point of confusion: ConnectX-7 and ConnectX-8 are not InfiniBand cards or Ethernet cards. They are both. The same adapter runs either fabric, and the decision is made by what you cable it into and how you configure the port. That makes the switch layer, not the NIC, the place where you commit.
Run the cards in InfiniBand mode and you pair them with NVIDIA Quantum-2, which offers 64 ports of 400 Gb/s NDR (or 128 ports at 200 Gb/s) for 51.2 Tb/s of bidirectional aggregate throughput per switch, with the in-network computing and adaptive routing that InfiniBand has long been built around. Run them in Ethernet mode and you pair them with NVIDIA Spectrum-X, the Ethernet platform engineered specifically for the loss patterns and congestion of giga-scale AI traffic, where ConnectX SuperNICs and Spectrum switches are designed to behave as one congestion-managed system rather than a generic NIC bolted onto a generic switch.
| Fabric mode | Card | Switch platform | Aggregate per switch |
|---|---|---|---|
| InfiniBand | ConnectX-7 / ConnectX-8 | Quantum-2 (QM9700) | 51.2 Tb/s, 64x 400G NDR |
| Ethernet | ConnectX-7 / ConnectX-8 | Spectrum-X (SN5600) | giga-scale AI Ethernet |
The choice is consequential and hard to reverse once cabling, optics, and switch firmware are in place. InfiniBand still tends to win on raw collective latency and on the maturity of its in-network reduction; Ethernet wins on operational familiarity, multi-vendor optics, and convergence with the rest of a datacenter network. Either way, the ConnectX generation you receive does not lock you into a fabric, but it does set the speed ceiling, and an 800 Gb/s SuperNIC is wasted if you cable it into a 400 Gb/s switch fabric in either mode. Plan the switch tier and the NIC generation together, because the slower of the two is the one your collectives will feel.
BlueField-3: the north-south and offload plane
The ConnectX cards carry GPU-to-GPU east-west traffic. They are not the whole network. North-south traffic (storage access, management, ingress and egress to the rest of the datacenter) and a growing set of infrastructure offloads ride on a separate device: the BlueField-3 DPU.
B200-class systems typically pair with the BlueField-3 B3220, which provides two ports of 200 Gb/s Ethernet or NDR200 InfiniBand. B300-class systems step up to the B3240, with two ports of 400 Gb/s Ethernet or NDR InfiniBand. Both carry sixteen Arm cores and 32 GB of DDR5, which is what lets them run storage initiators, encryption, telemetry, and software-defined networking off the host CPU and off the GPUs. Keeping that plane distinct from the GPU east-west rails is the point: you do not want a storage burst or a control-plane event stealing bandwidth from a gradient all-reduce. When you spec a node, account for the DPU generation alongside the NIC generation, because they move together across the B200-to-B300 transition and a build that mixes them is a build that was assembled, not delivered as a system.
Rail-optimized topology in one pass
Given one ConnectX per GPU, the dominant scale-out layout for these systems is rail-optimized. The idea is simple to state. Number the NICs on each node 0 through 7. Connect every node's NIC-0 to the same leaf switch (call it L0), every NIC-1 to a second leaf (L1), and so on, so that the eight NICs on a node fan out to eight separate leaves. Each of those leaf switches is a rail.
The payoff is that any two GPUs occupying the same position on their respective nodes (every GPU-3, say) can reach each other across a single switch hop, because they share a rail. Collectives that map cleanly onto rails, which most well-tuned all-reduce and all-to-all implementations do, then complete most of their traffic without ever climbing to the spine. Cross-rail traffic still traverses the spine, but the design pushes the heaviest, most regular communication onto single-hop paths. Rail-optimized topology is why the 1:1 GPU-to-NIC ratio on the baseboard is not redundancy for its own sake: it is the precondition that lets each GPU own a dedicated rail. Step up from ConnectX-7 to ConnectX-8 and every rail in the cluster doubles in width at once, which is why the generation printed on the baseboard, not the GPU model alone, is what an architect has to read before sizing the spine.
Why the system contract pins the fabric
Here is where networking architecture meets procurement, and why it matters who you buy from and how. The NIC generation on these systems is not a configurable accessory you bolt on later. ConnectX-7 is integrated onto the B200 HGX baseboard; ConnectX-8 is integrated onto the B300 HGX baseboard. The same is true of the BlueField-3 generation that ships with each. When you buy the system, you buy the fabric. The OEM SKU encodes it: a Supermicro SYS-A22GA-NBRT, a Gigabyte G894-AD1-AAX5, a Dell PowerEdge XE9780, an HPE ProLiant Compute XD685, a Lenovo SR680a V3, each carries a specific baseboard with a specific NIC generation and a specific DPU.
That is exactly why Rillor writes forward contracts on complete OEM systems rather than on loose GPUs. A contract for "eight Blackwell GPUs" tells you nothing about whether you are getting 400 Gb/s or 800 Gb/s of east-west bandwidth per GPU, whether the DPU is a B3220 or a B3240, or whether the baseboard will accept the switch fabric you have already deployed. A Rillor forward contract references the exact OEM system SKU, which pins the NIC generation, the DPU, the CPU, and the interconnect along with it. You can read the standardized terms in the anatomy of a Rillor forward contract, and you can see the live catalog of system-level SKUs on the marketplace and the SKU index. The contractual unit and the architectural unit are the same unit, which is the only way to guarantee that the cluster you designed is the cluster that lands on your dock.
The mismatched-fabric build is a real failure mode, not a hypothetical. It looks like a rack of B300 systems delivered against a switch fabric specced for 400 Gb/s, or a cluster where two procurement waves arrived a generation apart and now run rails at two different widths. Both halve your effective east-west bandwidth in the worst place, the place that gates large-cluster training, and both are expensive to unwind after optics and cabling are committed. Capturing the exact NIC generation in the contract at execution, with the OEM system SKU as the reference, is how you avoid discovering the mismatch at install. For buyers planning a multi-rack buildout, locking the system spec forward is the same discipline that lets you lock capacity before you need it rather than negotiating each generation transition under deadline.
Trade the forward curve on Rillor.
Rillor is invite only. Verified buyers and sellers transact standardized forward contracts on OEM GPU systems, with physical delivery and independent escrow on every contract.
Become a Partner →- Components, NVIDIA HGX AI Factory (Enterprise Reference Architectures)
- NVIDIA ConnectX-8 SuperNIC Specifications, NVIDIA Docs
- NVIDIA ConnectX-7 NDR 400G InfiniBand Adapter Card Datasheet (PDF)
- GB200 NVL72, NVIDIA
- NVLink and scale-up networking, Introl Blog
- NVIDIA BlueField-3 DPU Datasheet (PDF)
- NVIDIA Quantum-2 InfiniBand Platform, NVIDIA
- NVIDIA Spectrum-X Ethernet Platform for Giga-Scale AI, NVIDIA
- GPU Cluster Network Topology Design (fat-tree, dragonfly, rail-optimized), Introl Blog