QCKL News

NVIDIA Rubin at CES 2026: why the new AI platform changes the rules for 10/100GbE, memory, and GPU inference

Rubin shows that in AI clusters the bottleneck is no longer only the GPU, but the balance of networking, memory, storage, and rack engineering.

NVIDIA’s new platform matters not only because of PFLOPS growth, but because it forces operators to design faster networking, storage, and racks without hidden bottlenecks.

NVIDIA Rubin at CES 2026: why the new AI platform changes the rules for 10/100GbE, memory, and GPU inference

At CES 2026 NVIDIA delivered a rare signal of confidence for the market: the next step in AI hardware is not merely planned, but already physically exists inside the company, chips have come back from the fab, and systems are being brought up in the lab. That matters not because Rubin is a loud name, but because it gives the market a clearer view of where the bottlenecks will move in the next few years and why the infrastructure around the GPU will once again decide almost everything. Once the performance jump per GPU becomes large enough, networking, memory, and rack integration begin to limit progress more than raw compute itself.

Rubin is designed as a platform rather than a single star GPU. NVIDIA is selling an entire stack where CPU, GPU, DPU, NICs, NVLink, and Ethernet switching are assembled as one coherent machine. In that logic these are not six isolated chips, but six components of the same system, meant to be faster in inference, materially stronger in training, and more efficient in power usage at the same time. NVIDIA is talking about up to 5x GPU inference growth and up to 3.5x training growth versus Blackwell, along with higher inference performance per watt. Even if the real numbers vary by workload, the trend is unmistakable: demand for AI compute will rise faster than the market’s willingness to absorb excess power and cooling costs.

The heart of the platform is the Rubin GPU, which NVIDIA says is tuned for NVFP4 and can deliver up to 50 PFLOPS for inference and up to 35 PFLOPS for training at the same precision. The accelerator behind the accelerator is memory: up to 288GB of HBM4 and up to 22 TB/s of aggregate bandwidth are being claimed, which NVIDIA frames as roughly 2.8 times what Blackwell provides. At the same time, the chip itself becomes significantly more complex, with hundreds of billions of transistors and manufacturing at 3nm. This is no longer a simple install-the-card-and-go story; the system has to be balanced end to end, or investment quickly turns into heat.

To scale those GPUs inside the rack, NVIDIA is leaning on NVLink 6 and a new NVLink Switch. The company is talking about as much as 3.6 TB/s of NVLink bandwidth per GPU and a move to 400Gbps SerDes, while also noting that the switch chip itself requires liquid cooling. That is a useful marker for where density is heading: traffic inside the rack is becoming aggressive enough that conventional cooling no longer feels like a safe default. It is also a warning to customers that total cost of ownership will depend more and more on engineering details, from cabling discipline to thermal design.

Alongside Rubin GPU comes Vera CPU, an Arm processor with 88 cores and SMT support up to 176 threads. NVIDIA positions it as a step forward from Grace in data handling and compression, and it moves memory into the modular SOCAMM format with up to 1.5TB of LPDDR5X and about 1.2 TB/s of bandwidth. That matters because modularity addresses a long-standing platform problem: memory used to be fixed in place and hard to upgrade. NVIDIA is also stressing rack-scale confidential computing, meaning encryption of the compute domain not only at the GPU layer but at the CPU layer across the full rack.

Outside the rack, the second layer of the story is networking. ConnectX-9 is presented with 1.6 Tb/s of aggregate bandwidth and 200G PAM4 SerDes, while BlueField 4 brings its own networking path plus gains in compute and memory versus the previous generation. At the Ethernet switching level, Spectrum-6 and Spectrum-X introduce co-packaged optics as a major technical lever. NVIDIA promises notable gains in power efficiency and reliability versus traditional optical approaches. The scale of the high-end switch models says everything: hundreds of 800G ports and hundreds of terabits of aggregate throughput. This is no longer a classic enterprise story, but a world where every percentage point of efficiency turns into major operating cost differences over the lifecycle.

On the systems side NVIDIA confirms two lines: Vera Rubin NVL72 as the dense everything-inside-NVLink design, and HGX Rubin NVL8 as the path for customers that need to stay closer to the x86 world. For NVL72, the operational model is just as important as the performance. NVIDIA is talking about modular cable-free trays, reducing rack assembly time from 100 minutes to 6 minutes, and enabling service operations without downtime for health checks and network work. This is the language of companies that have already been burned by integration complexity and now want predictable deployment at scale.

A particularly interesting idea is NVIDIA’s KV-cache layer for inference, the Inference Context Memory Storage Platform. Since context windows and intermediate data keep growing, holding everything on-node becomes too expensive, while recomputing it is also wasteful. NVIDIA’s answer is a dedicated SSD-backed layer for context storage, connected through ConnectX and BlueField and supported by software in the broader platform stack. For operators this becomes another lever: not merely adding more GPUs, but increasing effective inference throughput across the cluster while lowering energy cost.

For QCKL customers, the practical conclusion is straightforward: AI infrastructure is accelerating not only because of the GPU, but because of the balance between compute, memory, and networking, and that balance will become a competitive line in 2026. When Rubin-class systems start appearing on the horizon, it makes sense to design the surrounding environment in advance for fast storage, high network throughput, and predictable scalability so that growth does not get trapped by architectural details. If you are planning GPU infrastructure for inference or training and want to lay the right network and server foundation ahead of time, take a look at QCKL solutions and choose a configuration that matches your workload profile. We can help build a stack where the bottleneck stays in the compute itself, not in everything around it.