Home Gaming Nvidia Launches Nemotron 3 Super, a 120 Billion Parameter Open AI Model...

Nvidia Launches Nemotron 3 Super, a 120 Billion Parameter Open AI Model Built for Agentic Workloads

20
0
  • Nvidia has released Nemotron 3 Super, a 120 billion parameter open MoE model that only activates 12.7 billion parameters per forward pass.
  • Nemotron 3 Super delivers up to 7.5x better throughput than Qwen3.5-122B-A10B for agent workloads with settings of 8,000 inputs and 64,000 outputs.
  • The model is fully open sourced under the Nvidia Nemotron Open Model License, with checkpoints and training data available on Hugging Face.

Nvidia launches Nemotron 3 Super with 7.5x throughput gain over Qwen3.5-122B

Nvidia’s latest model only activates 12.7 billion parameters per pass through thanks to a Mixture-of-Experts (MoE) architecture, meaning most of its weight remains inactive during inference. This design choice directly addresses two issues developers encounter when deploying AI agents in multiple stages: the added cost of extended reasoning chains and the exponential usage of tokens which can be multiplied by 15 in multi-agent pipelines.

Nemotron 3 Super is the second model in Nvidia’s Nemotron 3 family, after the Nemotron 3 Nano released in December 2025. Nvidia announced its release around March 10, 2026.

The model uses a hybrid Mamba-Transformer architecture on 88 layers. Mamba-2 blocks process long sequences with linear-time efficiency, while Transformer attention layers preserve recall precision. This combination gives the model native support for context windows of up to a million tokens without the memory penalties typical of pure attention architectures.

Nvidia has also integrated a LatentMoE routing system that compresses token embeddings into low-rank space before sending them to 512 experts per layer, activating 22 at a time. The company says this allows for approximately four times more experts for the same inference cost compared to standard MoE approaches, and allows for finer task specialization, such as separating Python logic from SQL management at the expert level.

Nvidia Launches Nemotron 3 Super, a 120 Billion Parameter Open AI Model Built for Agentic Workloads
Image source: Nvidia blog.

Multi-token prediction layers, using two shared-weight heads, accelerate the generation of thought chains and enable native speculative decoding. On structured tasks, Nvidia reports generation up to three times faster.

The model was pre-trained on 25 trillion tokens in two phases. The first phase used 20 trillion general data tokens. The second used 5 trillion high-quality tokens optimized for benchmark performance. A final expansion phase on 51 billion tokens extended the native context to one million tokens. Post-training included supervised fine-tuning on approximately seven million samples and reinforcement learning in 21 environments with over 1.2 million deployments.

In performance testing, Nemotron 3 Super scored 83.73 on MMLU-Pro, 90.21 on AIME25, and 60.47 on SWE-Bench using OpenHands. On PinchBench, it achieved 85.6%, the highest score ever among open models in its category. During the long context evaluation, he obtained a score of 91.64 on RULER 1M. Compared to GPT-OSS-120B, Nemotron 3 Super offers 2.2x higher throughput with 8k input and 64k output. Compared to Qwen3.5-122B-A10B, this figure reaches 7.5 times. Nvidia also announces more than five times higher throughput and up to two times higher precision than the previous generation of Nemotron Super. Nvidia trained the model end-to-end in its NVFP4 four-bit floating point format, optimized for Blackwell GPUs. On B200 hardware, Nvidia claims inference runs up to four times faster compared to FP8 on H100, with no reported loss of accuracy. FP8 and NVFP4 quantized checkpoints maintain 99.8% or greater accuracy at full precision. The model also powers the Nvidia AI-Q search agent, which reached the top spot in the Deepresearch Bench rankings.

Nemotron 3 Super is fully open sourced under the Nvidia Nemotron Open Model License. Checkpoints in BF16, FP8 and NVFP4 formats, as well as pre-training data, post-training samples and reinforcement learning environments, are available on Hugging Face. Inference is supported through Nvidia NIM, build.nvidia.com, Perplexity, Openrouter, Together AI, Google Cloud, AWS, Azure, and Coreweave, with on-premises options through Dell Enterprise Hub and HPE. Developers can access training recipes, fine-tuning guides, and inference recipe collections through the NeMo platform using vLLM, SGLang, and TensorRT-LLM.