
NVIDIA anticipated the launch of DeepSeek V4 by offering immediate support for its Blackwell GPUs, with no adaptation period. The first performance measurements were communicated by the company as soon as the model was available.
This responsiveness is part of a context of increased competition around very large-scale inference models, where deployment speed is becoming a determining criterion for data center operators.
DeepSeek V4, two models with distinct architectures
DeepSeek V4 comes in two variants: a Pro model with 1.6 trillion parameters and a Flash version with 284 billion. Both are designed to reduce computing requirements compared to previous generations.
The model consumes 27% of the floating point operations needed for single token inference, and only 10% of the KV cache space in a context of a million tokens. These reductions significantly lighten the memory load on a large scale.
FP4 quantization, more precisely MXFP4, plays a central role in these gains. It compresses the numerical representations of the model weights, which reduces memory traffic and sampling latency during inference.
Up to 3,500 tokens per second on GB300
According to data presented by NVIDIA, the GB300 GPU, also called Blackwell Ultra, achieves a throughput close to 3,500 tokens per second for DeepSeek V4. The company specifies that these figures are preliminary and should progress with future optimizations.

The Blackwell software stack mobilizes several technologies to achieve this: NVFP4 precision, the Dynamo orchestration tool, optimized CUDA kernels and advanced parallelization techniques. NVIDIA also offers these capabilities through its NIM microservices and fine-tuning flows.
The company positions Blackwell as a platform suitable for models exceeding a trillion parameters, with long context management of up to a million tokens in real deployment conditions.
MXFP4 compatibility exceeds hardware boundaries
One notable element concerns Huawei’s future Ascend processors. The Ascend 950PR and Ascend 950DT chips, scheduled for launch in 2026, also integrate MXFP4 instructions.
This means that DeepSeek V4 will be technically compatible with these Chinese domestic accelerators, without modification of the model. Standardization around MXFP4 could thus facilitate the deployment of V4 on a wider hardware spectrum than just NVIDIA GPUs.
This cross-compatibility reflects a broader trend: low-precision quantization formats are becoming de facto standards in the ecosystem of large language models.
Optimizations still awaited
The current figures of 3,500 tokens per second are a starting point, not a ceiling. NVIDIA says its team is working to further refine the software stack for DeepSeek V4, particularly in terms of co-design between hardware and model.
The company is also described as an active contributor to the open source ecosystem, with several hundred projects published under open licenses. DeepSeek V4 is part of this logic of open models that NVIDIA seeks to optimize as a priority during each major launch.
Source : WCCFTech





