Home Gaming Nvidia Explodes MLPerf V6.0, But at What Infrastructure Cost to Follow?

Nvidia Explodes MLPerf V6.0, But at What Infrastructure Cost to Follow?

7
0

Nvidia is once again dominating the MLPerf inference benchmark v6.0. According to results published on April 1st and reported by ITHome, the Blackwell Ultra platform in GB300 NVL72 configuration outperforms all scenarios, with a nine-fold advantage in “wins” over the closest competitor. A key highlight for interactive LLMs: on DeepSeek-R1 in server mode, the company displays 8064 tokens per second per GPU, which is 2.77 times better than under MLPerf v5.1.

A range of models has been revamped in the MLPerf v6.0 suite. For LLMs, it integrates GPT-OSS-120B, focused on math/science reasoning and coding, and evolves DeepSeek-R1 with an interactive scenario that toughens TTFT and token throughput requirements, representing a real-time chatbot. Qwen3-VL-235B introduces a VLM for converting unstructured data into metadata in multimodal scenarios.

Another notable addition is WAN-2.2 for text-to-video, which shifts from server mode to SingleStream, better suited for video generator latency. The recommendation section switches to DLRMv3 (Transformer architecture by Meta), larger and more expensive than DCNv2. For edge computing, the detection benchmark transitions to YOLOv11 Large from Ultralytics.

On the DeepSeek-R1 server, Nvidia achieves a throughput of 8064 tokens per second per GPU, a reference for this edition. For Llama 3.1 405B, the announced gains reach 1.52—server mode and 1.21—offline mode. The highlighted hardware stack, GB300 NVL72, fits into the high-density multi-GPU systems optimized for large-scale inference.

These numbers reflect the evolution of kernels and memory paths as well as graph optimization for interactive scenarios, where TTFT and token throughput become decisive. Expanding MLPerf to include VLM and text-to-video validates Nvidia’s strategy of optimizing beyond just LLMs, on increasingly heterogeneous pipelines.

The push towards interactive and giant models will naturally enhance the appeal of NVL configurations and high-throughput intra-node networks. For hyperscalers, the advantage in tokens per second per GPU on DeepSeek-R1 and the leap on Llama 3.1 405B guide CAPEX decisions towards Blackwell for dialogue/agent workloads, at least in the short term, pending consolidated responses from competitors on these new scenarios.

Source: ITHome