Home Gaming MLPerf 6.0: MI355X Surpasses 1M Tokens/s and Trails B200/B300, Including Multinode

MLPerf 6.0: MI355X Surpasses 1M Tokens/s and Trails B200/B300, Including Multinode

168
0

The milestone of one million tokens per second is passed, and the multi-node scale follows almost linearly. AMD pushes its MI355X beyond the simple throughput record with reproducible scores among partners.

AMD Instinct MI355X and MLPerf 6.0: figures and scope

Built on 3nm, the AMD Instinct MI355X GPUs (CDNA 4 architecture) total 185 billion transistors, support FP4/FP6, and carry up to 288GB of HBM3E memory. Up to 10 PFLOPS in FP4/FP6, capacity of up to 520 billion parameters on a single GPU, and UBB8 node in air cooling or DLC: the platform is designed for large-scale inference.

In MLPerf Inference 6.0, AMD surpasses 1 million tokens/s on Llama 270B (Server and Offline) and GPT-OSS-120B (Offline) through multi-node deployments on MI355X. Partners reproduce scores within +/-4% (sometimes +/-1%), covering four Instinct generations: MI300X, MI325X, MI350X, and MI355X.

Performance, scale, and model coverage

Generation vs. generation: on Llama 270B Server, an MI355X reaches 100,282 tokens/s, 3.1x the previous throughput submitted on MI325X. Gains come from CDNA 4 + ROCm, calculation density, FP4/FP6 formats, and HBM3E.

Single-node competitiveness Llama 270B: compared to NVIDIA B200, the MI355X platform matches in Offline, delivers 97% in Server, and 119% in Interactive. Against B300: 93% in Server, 92% in Offline, and 104% in Interactive.

GPT-OSS-120B (first MLPerf integration): 111% of B200 in Offline and 115% in Server on an MI355X node. Against B300, 91% in Offline and 82% in Server.

Wan 2.2 single stream: 93% of B200 and 87% of B300 officially. Post-deadline (MLCommons unverified): 108% of B200 and parity with B300 in Single Stream, 111%/88% in Offline.

Multi-node scalability Llama 270B: from 1 to 11 nodes, close to linearity. At 11 nodes/87 MI355X: 1,042,110 tokens/s (Offline), 1,016,380 tokens/s (Server), and 785,522 tokens/s (Interactive). Efficiency: 93% (Offline), 93% (Server), 98% (Interactive).

Multi-node scalability GPT-OSS 120B: at 12 nodes/94 MI355X: 1,031,070 tokens/s (Offline) and 900,054 tokens/s (Server), with 92% and 93% efficiency, respectively. Second model beyond 1 million tokens/s in multi-node.

Ecosystem, heterogeneity, and ROCm

Nine partners submit on Instinct: Cisco, Dell, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Supermicro, Red Hat. Partner MI355X results stick +/-4% to AMD numbers, even in unposted loads, ensuring real-world reproducibility.

First heterogeneous submission 3 GPUs (MI300X + MI325X + MI355X, Dell + MangoBoost): 141,521 tokens/s (Llama 270B Server) and 151,843 tokens/s (Offline). MI355X in the USA (Dell), MI300X/MI325X in Korea, showing inter-geography orchestration.

ROCm drives FP4 execution, GPU-GPU communications for multi-node scale, dynamic distribution in heterogeneous environments, and rapid model deployment (Llama, Wan, GPT-OSS). Result: performance, scalability, and flexibility across the entire Instinct portfolio.

Roadmap: MI300X (2023) laid the foundations for GenAI, MI325X (2024) increased computation and HBM3E, the MI350 series including MI355X (2025) adds FP4/FP6 and more model capacity for inference. In 2026, AMD plans MI400 under CDNA 5 and the Helios rack-scale solution.

The approach combines a significant throughput jump and an increase in software maturity. Parity or near-parity at single-node, linearity in clusters, and multi-OEM reproducibility enhance the credibility of a serious alternative to high-volume LLM and multimodal deployments, with a specific focus on reducing token cost through scale efficiency.

Source: TechPowerUp