Nvidia once again dominates the MLPerf v6.0 inference benchmark. According to results published on April 1 and reported by ITHome, the Blackwell Ultra platform in GB300 NVL72 configuration achieves the best performance across all scenarios, with an announced advantage of nine times more wins than the closest competitor. Highlight for interactive LLMs: on DeepSeek‒R1 in server mode, the firm displays 8064 tokens per second per GPU, or 2.77 times better than under MLPerf v5.1.

A panel of models brought up to date
The MLPerf v6.0 suite greatly expands the spectrum of loads. On the LLM side, it integrates GPT‒OSS‒120B, focused on math/science reasoning and code, and evolves DeepSeek‒R1 with an interactive scenario that tightens the requirements of TTFT and flow per token, more representative of a real-time chatbot. On the multimodal side, Qwen3‒VL‒235B marks the arrival of a VLM for the conversion of unstructured data into metadata.

Another notable addition, WAN‒2.2 for text‒to‒video abandons server mode in favor of SingleStream, better suited to the latency of video generators. The recommendation section switches to DLRMv3 (Transformer architecture provided by Meta), larger and more expensive than DCNv2. For the edge, the detection benchmark switches to YOLOv11 Large from Ultralytics.
Blackwell Ultra empile les records
On DeepSeek‒R1 server, Nvidia credits a throughput of 8064 tokens/s/GPU, reference for this edition. On Llama 3.1 405B, the announced gains reach 1.52× in server mode and 1.21× in offline mode. The highlighted hardware stack, GB300 NVL72, is part of the line of high-density multi-GPU systems optimized for large-scale inference.
These figures reflect both the evolution of kernels and memory paths as well as the optimization of graphs for interactive scenarios, where TTFT and throughput per token become decisive. The expansion of MLPerf to VLMs and text-to-video also validates Nvidia’s strategy to optimize beyond just LLMs, on increasingly heterogeneous pipelines.
The push on interactive and giant models will mechanically strengthen the appeal of NVL configurations and very high-speed intra-node networks. For hyperscalers, the advantage in tokens/s/GPU on DeepSeek‒R1 and the jump on Llama 3.1 405B direct CAPEX arbitrations towards Blackwell on dialogue/agent workloads, at least in the short term, while awaiting consolidated responses from competitors on these new scenarios.
Source : ITHome



