VentureBeat Mar 11, 11:42 PM
Nvidia's new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chats — threatening their cost-effectiveness in handling enterprise tasks.
But today, Nvidia sought to help solve this problem with the release of Nemotron 3 Super, a 120-billion-parameter hybrid model, with weights posted on Hugging Face.
By merging disparate architectural philosophies—state-space models, transformers, and a novel "Latent" mixture-of-experts design—Nvidia is attempting to provide the specialized depth required for agentic workflows without the bloat typical of dense reasoning models, and all available for commercial usage under mostly open weights.
Triple hybrid architecture
At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model utilizes a Hybrid Mamba-Transformer backbone, which interleaves Mamba-2 layers with strategic Transformer attention layers.
To understand the implications for enterprise production, consider the "needle in a haystack" problem. Mamba-2 layers act like a "fast-travel" highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a massive 1-million-token context window without the memory footprint of the KV cache exploding. However, pure state-space models often struggle with associative recall.
To fix this, Nvidia strategically inserts Transformer attention layers as "global anchors," ensuring the model can precisely retrieve specific facts buried deep within a codebase or a stack of financial reports.
Beyond the backbone, the model introduces Latent Mixture-of-Experts (LatentMoE). Traditional Mixture-of-Experts (MoE) designs route tokens to experts in their full hidden dimension, which creates a computational bottleneck as models scale. LatentMoE solves this by projecting tokens into a compressed space before routing them to specialists.
This "expert compression" allows the model to consult four times as many specialists for the exact same computational cost. This granularity is vital for agents that must switch between Python syntax, SQL logic, and conversational reasoning within a single turn.
Further accelerating the model is Multi-Token Prediction (MTP). While standard models predict a single next token, MTP predicts several future tokens simultaneously. This serves as a "built-in draft model," enabling native speculative decoding that can deliver up to 3x wall-clock speedups for structured generation tasks like code or tool calls.
The Blackwell advantage
For enterprises, the most significant technical leap in Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By pre-training natively in NVFP4 (4-bit floating point), Nvidia has achieved a breakthrough in production efficiency.
On Blackwell, the model delivers 4x faster inference than 8-bit models running on