Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
- URL: http://arxiv.org/abs/2511.16664v1
- Date: Thu, 20 Nov 2025 18:59:21 GMT
- Title: Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
- Authors: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov,
- Abstract summary: Nemotron Elastic is a framework for building reasoning-oriented LLMs.<n>It embeds nested submodels within a single parent model.<n>Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment.
- Score: 80.72350166388601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
Related papers
- FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment [20.331469310989956]
We argue that importance-ordered nested components can be extracted from pretrained models, and selectively activated on the available computational budget.<n>Our approach enables a "train-once, deploy-everywhere" paradigm that offers a graceful trade-off between cost and performance without training from scratch for each budget.
arXiv Detail & Related papers (2026-02-02T19:01:40Z) - Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization [60.309915093470416]
Matryoshka MoE (M-MoE) is a training framework that instills a coarse-to-fine structure directly into the expert ensemble.<n>Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
arXiv Detail & Related papers (2025-09-30T16:56:44Z) - Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning [76.88243649182886]
Hybrid architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance.<n>Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost.<n>We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities.
arXiv Detail & Related papers (2025-04-15T17:26:29Z) - Efficient Construction of Model Family through Progressive Training Using Model Expansion [35.743595710122506]
We propose an efficient method for constructing the model family through progressive training.<n>Our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models.
arXiv Detail & Related papers (2025-04-01T10:21:52Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model [20.054342930450055]
This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models.
PLRD allows for significant reductions in computational overhead and energy consumption.
Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs.
arXiv Detail & Related papers (2024-06-28T15:27:57Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.