Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
- URL: http://arxiv.org/abs/2508.15884v3
- Date: Sun, 28 Sep 2025 18:41:58 GMT
- Title: Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
- Authors: Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai,
- Abstract summary: Jet-Nemotron is a new family of hybrid-architecture language models.<n>It matches or exceeds the accuracy of leading full-attention models.
- Score: 42.46046429414803
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
Related papers
- Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts [16.810363861148513]
Nanbeige4.1-3B is an open-source small language model (SLM)<n>It simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters.<n>Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously.
arXiv Detail & Related papers (2026-02-13T13:10:46Z) - NVIDIA Nemotron 3: Efficient and Open Intelligence [227.47413816066845]
Nemotron 3 family of models deliver strong agentic, reasoning, and conversational capabilities.<n>Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control.<n>Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens.
arXiv Detail & Related papers (2025-12-24T00:24:05Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Llama-Nemotron: Efficient Reasoning Models [105.18850667504097]
We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models.<n>The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B)
arXiv Detail & Related papers (2025-05-02T01:35:35Z) - Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models [192.5620883942846]
Nemotron-H is a family of 8B and 56B/47B hybrid Mamba-Transformer models.<n>We replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers.<n>Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized open-sourced Transformer models.
arXiv Detail & Related papers (2025-04-04T17:41:58Z) - EMOv2: Pushing 5M Vision Model Frontier [92.21687467702972]
We set up the new frontier of the 5M magnitude lightweight model on various downstream tasks.<n>Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer.<n>Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth, we investigate the performance upper limit of lightweight models with a magnitude of 5M.
arXiv Detail & Related papers (2024-12-09T17:12:22Z) - Puzzle: Distillation-Based NAS for Inference-Optimized LLMs [17.72841008597783]
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.<n>We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities.<n>We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models.
arXiv Detail & Related papers (2024-11-28T13:45:42Z) - YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs [14.85882314822983]
In order to map deep neural network (DNN) based object detection models to edge devices, one typically needs to compress such models significantly.
In this paper, we propose a novel edge GPU friendly module for multi-scale feature interaction.
We also propose a novel learning backbone adoption inspired by the changing translational information flow across various tasks.
arXiv Detail & Related papers (2021-10-26T14:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.