HBM3e capacity and bandwidth across the Blackwell line.

When engineers size a model to hardware, the first instinct is to reach for FLOPS. It is the wrong number to lead with. For most production training and inference workloads, the binding constraint is not how fast a node can multiply matrices, it is how many bytes of parameters, activations, and KV cache you can hold in high-bandwidth memory at once. Compute throughput sets how long a step takes. Memory capacity sets whether the step is possible at all. Cross the capacity line and your options collapse to a short list: shard across more nodes, offload to slower memory, or shrink the workload. None of those are free, and two of them are slow.

That is why a memory-centric read of the NVIDIA data-center line is more useful for capacity planning than a throughput-centric one. The HBM3e generation, from the H200 through the Blackwell B200 and B300, has moved per-GPU capacity faster than it has moved peak compute, and the per-node aggregate is the figure that actually governs which models fit. This piece lays out the full capacity table across generations, explains why capacity gates more often than compute does, and gets specific about when the B300 premium is the rational buy and when the previous-generation H200 is the smarter one.

Key takeaways

Per-node HBM (8-GPU) climbs from 640 GB on H100 to 1.13 TB on H200, 1.44 TB on B200, and 2.3 TB on B300, and that aggregate, not peak FLOPS, is what decides whether a model and its KV cache resident-fit.
Memory capacity gates model and KV-cache size more often than compute throughput does. Long-context and large-batch inference run out of bytes before they run out of math.
HBM3e bandwidth gains (4.8 TB/s per GPU on H200, up to 8 TB/s per GPU on B300) are what let the added capacity actually be fed, so larger batches and longer contexts stay throughput-positive.
Samsung, Micron, and SK Hynix are the three HBM3e sources behind GPU availability. Stack qualification, especially 12-high, is a real gate on supply.
For memory-bound inference where the full model and a deep KV cache must stay resident, the B300 premium over the B200 pays for itself. When Blackwell lead times do not fit, the H200 is the memory-rich previous-gen option.
RIL-GX-B300-2T is the highest-capacity 8-GPU node Rillor lists. When capacity is the deciding spec, it is the default.

The capacity table that actually matters

Here is the generational progression for the standard 8-GPU SXM/OAM-class node, which is the unit most buyers actually procure (the Supermicro SYS-A22GA-NBRT, the Gigabyte G894-AD1-AAX5, the Dell PowerEdge XE9680L, and their peers all build around this baseboard).

Generation	Memory type	Per-GPU capacity	Per-node (8 GPU)	Per-GPU bandwidth
H100 SXM5	HBM3	80 GB	640 GB	3.35 TB/s
H200	HBM3e	141 GB	~1.13 TB	4.8 TB/s
B200	HBM3e	180 GB	1.44 TB	up to ~8 TB/s
B300 (Blackwell Ultra)	HBM3e	288 GB	~2.3 TB	up to 8 TB/s

The shape of that table is the whole argument. From H100 to B300, per-GPU capacity grew 3.6x and per-node capacity grew from 640 GB to roughly 2.3 TB. The H100 baseline of 640 GB across eight 80 GB SXM5 modules is the anchor everyone still benchmarks against. The H200 kept the Hopper compute architecture and simply swapped in HBM3e, lifting per-GPU memory to 141 GB and bandwidth to 4.8 TB/s, which is the cleanest possible demonstration that capacity and bandwidth are an axis you can move independently of raw compute.

Blackwell pushed both. The B200 lands at 1.44 TB per node, and the B300, also called Blackwell Ultra, reaches 288 GB per GPU through 12-high HBM3e stacks, a 50 percent capacity step over the B200's 8-high stacks, for roughly 2.3 TB across the node. That 12-high jump is the single most consequential memory change in the current line, and it is the reason a B300 node can hold a model and a deep KV cache that a B200 node has to start offloading or sharding.

2.3 TB

B300 8-GPU HBM3e

288 GB

B300 per-GPU (12-high)

8 TB/s

B300 per-GPU bandwidth

Why capacity gates before compute does

The intuition that compute is the bottleneck comes from training thought experiments where the model is small relative to the hardware. In the regimes that dominate real spend, large models and long contexts, the byte budget runs out first.

Start with the parameters. A model's weights have to live somewhere, and if they do not all fit in HBM you pay a tax on every step to move them in and out. Then add the optimizer state in training, which for common optimizers is several times the parameter footprint. Then add activations, which scale with batch size and sequence length. The result is that a node that has plenty of compute headroom can still be unable to hold the working set, and the only honest move is to add nodes or accept a slower memory tier. Adding nodes means more interconnect, more failure surface, and more cost per useful FLOP delivered.

Inference makes the point even more sharply because of the KV cache. Every token in the context window contributes key and value tensors that must stay resident for the duration of the request, and that cache grows linearly with context length and with the number of concurrent sequences. Long-context serving is, in practice, a memory-capacity problem wearing a throughput costume. When you double the context window, you roughly double the per-request KV footprint, and the number of concurrent requests a node can serve falls accordingly unless you have the capacity to absorb it. This is why the B300's larger pool translates so directly into serving economics. More resident KV means more concurrent long-context sessions per node, which means a lower cost per served token at the configurations that long-context products actually run.

The clean way to state it: compute throughput determines latency per step, capacity determines whether the workload is feasible without splitting it. Feasibility is the harder constraint, and it is the one that capacity, not FLOPS, controls.

Bandwidth is what lets the capacity be used

Capacity without bandwidth would be a trap. A bigger pool that you cannot feed fast enough just relocates the bottleneck. The HBM3e generation moved both numbers together, and that is what makes the added capacity usable rather than ornamental.

The H200 lifted per-GPU bandwidth to 4.8 TB/s from the H100's 3.35 TB/s, a 43 percent gain, on the same compute architecture. The DGX B200 delivers 64 TB/s of aggregate HBM3e bandwidth across its eight Blackwell GPUs, and the B300 pushes per-GPU bandwidth up to 8 TB/s. For memory-bound work, bandwidth is the throughput number that matters more than peak FLOPS, because the GPU spends its time waiting on memory, not on the math units.

This is the mechanism behind the larger-batch and longer-context story on Blackwell. Decode-heavy inference is dominated by reading the KV cache and the weights for each generated token, which is a bandwidth-bound operation. Higher bandwidth lets a node sustain larger batch sizes before it saturates, and a larger resident pool lets those batches carry longer contexts. The two improvements compound. You get more capacity to hold the working set and more bandwidth to stream it, which is why a B300 node does not just hold a bigger model, it serves it at batch sizes that keep the expensive silicon busy. The fabric that ties these nodes together past a single chassis (ConnectX-7 and ConnectX-8 NICs feeding the scale-out network, with NVLink5 inside the rack) carries the story beyond the node, but inside the chassis the binding numbers are HBM3e capacity and HBM3e bandwidth.

The supplier landscape behind availability

HBM3e is not made by NVIDIA. It is supplied by three vendors, Samsung, Micron, and SK Hynix, and which of them can deliver qualified stacks at volume is a real constraint on how many GPUs reach the market. This is the part of the capacity story that lives upstream of the GPU and is easy to overlook when you read a spec sheet.

The 12-high HBM3e stack that gives the B300 its 288 GB per GPU is harder to build and to qualify than the 8-high stack on the B200. SK Hynix moved first on 12-high HBM3e for the GB300, with Samsung and Micron clearing NVIDIA qualification afterward. The practical implication for a buyer is that the highest-capacity parts ride on the most constrained memory supply, and the months between when a stack is announced and when it ships qualified at volume are exactly the months your delivery date depends on. Memory qualification, not wafer starts, is frequently the gate. When you read that a particular GPU is hard to get, the binding constraint is often the HBM behind it rather than the logic die.

This is also why a forward contract on the complete system, rather than a spot scramble, maps so well to the memory-supply reality. The lead time on a B300 node is the lead time on its 12-high HBM3e, and a contracted delivery month with a deposit in escrow turns that upstream uncertainty into a date both sides are bound to.

When the B300 premium is the right buy

The B300 commands a premium over the B200 at the node level, and for a meaningful share of workloads that premium is straightforwardly worth paying. The deciding question is whether your workload is memory-bound, and specifically whether the full model and a deep KV cache need to stay resident without offloading.

A B300 node holds roughly 2.3 TB of HBM3e against the B200's 1.44 TB. That extra 60 percent of capacity is the difference between a model that fits resident on a single node and one that has to be sharded across two, with all the interconnect overhead and failure surface that implies. For long-context inference, the larger pool directly raises how many concurrent sessions a node can serve, which lowers cost per token at exactly the configurations long-context products run in. For training large models, the headroom can be the difference between a workable per-device batch size and one that forces you into more aggressive parallelism. In each of those cases the premium buys you fewer nodes for the same resident footprint, and fewer nodes is usually cheaper all-in than more.

The B300 is not always the answer. If your workload is compute-bound and fits comfortably in 1.44 TB, the B200 is the better value. The rule is simple. If capacity is the spec that decides feasibility, buy the B300. If it is not, do not pay for memory you will not fill.

When the H200 is the smarter call

There is a third option that capacity-focused buyers underrate. The H200 carries 141 GB of HBM3e per GPU and roughly 1.13 TB per node, which makes it the memory-rich previous-generation choice. It sits above the H100's 640 GB by a wide margin and is close enough to the B200's 1.44 TB to matter when Blackwell lead times do not fit your timeline.

The H200 wins on two axes. First, availability. Blackwell parts, and B300 parts especially, ride the most constrained 12-high HBM3e supply, while Hopper-class HBM3e is further along its qualification curve. If your facility-readiness milestone is sooner than a clean B300 delivery window, a node with 1.13 TB of HBM3e that you can take delivery on now can beat a higher-capacity node you cannot get until a later quarter. Second, fit. If your model and KV cache live comfortably inside 1.13 TB per node, the H200 gives you the memory you need without the Blackwell premium, and the bandwidth at 4.8 TB/s per GPU is more than respectable for memory-bound serving. The honest framing is that capacity per node and time-to-delivery are both real specs, and the H200 often wins the second one without giving up much of the first.

Sizing the buy on Rillor

Once you know your resident-memory target, the procurement decision is mostly about matching that target to a delivery month at a price you can defend. That is what the Rillor marketplace is for. Buyers commit to standardized forward contracts on complete OEM GPU systems with physical delivery, a 10 percent deposit at execution held by an independent escrow agent, and the balance at delivery, which converts a memory-capacity plan into a contracted delivery date rather than a spot guess.

The highest-capacity 8-GPU node Rillor lists is RIL-GX-B300-2T, at roughly 2.3 TB of HBM3e across eight B300 GPUs. When capacity is the deciding spec, the workload is memory-bound, and the full model plus a deep KV cache must stay resident, it is the default. When timeline trumps absolute capacity, the H200-class and B200-class systems in the catalog cover the rest of the curve. The point of putting all of them on one forward market is that you can size to the spec that actually binds, capacity, and then lock the delivery without overpaying for the urgency.

PRICING

See the forward price on this system.

Request indicative pricing, lead time, and delivery windows for this SKU. Every quote runs through the standard Rillor contract, deposit, and escrow flow.

Request pricing →

Sources & further reading

HBM3e capacity and bandwidth across the Blackwell line.

The capacity table that actually matters

Why capacity gates before compute does

Bandwidth is what lets the capacity be used

The supplier landscape behind availability

When the B300 premium is the right buy

When the H200 is the smarter call

Sizing the buy on Rillor

See the forward price on this system.

Trade the forward curve on Rillor.

Get Rillor market reports in your inbox.

Keep reading.

B200 versus B300: what actually changes at the system level.

H200 versus B200: when the previous generation still wins.

How to lock 12 months of GPU capacity without overpaying spot.