From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
- URL: http://arxiv.org/abs/2502.12325v1
- Date: Mon, 17 Feb 2025 21:12:57 GMT
- Title: From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
- Authors: Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik,
- Abstract summary: Training large language models (LLMs) for different inference constraints is computationally expensive.
DynaMoE adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost.
Our method achieves similar aggregated accuracy across downstream tasks, despite using only $frac19textth$ of their fine-tuning cost.
- Score: 37.50902921493273
- License:
- Abstract: Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
Related papers
- Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget.
We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM.
We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z) - Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts [20.202031878825153]
We propose a novel dynamic data mixture for MoE instruction tuning.
Inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets.
Results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge & reasoning tasks and open-ended queries.
arXiv Detail & Related papers (2024-06-17T06:47:03Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models [31.960749305728488]
We introduce a novel concept dubbed modular neural tangent kernel (mNTK)
We show that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $lambda_max$.
We propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $lambda_max$ exceeding a dynamic threshold.
arXiv Detail & Related papers (2024-05-13T07:46:48Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics [7.58472343957521]
We show that training dynamics are highly transferable across model sizes and pre-training methods.
We propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT)
arXiv Detail & Related papers (2023-10-10T12:53:48Z) - Domain Generalization via Balancing Training Difficulty and Model
Capability [61.053202176230904]
Domain generalization (DG) aims to learn domain-generalizable models from one or multiple source domains that can perform well in unseen target domains.
Despite its recent progress, most existing work suffers from the misalignment between the difficulty level of training samples and the capability of contemporarily trained models.
We design MoDify, a Momentum Difficulty framework that tackles the misalignment by balancing the seesaw between the model's capability and the samples' difficulties.
arXiv Detail & Related papers (2023-09-02T07:09:23Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness
and Accuracy for Free [115.81899803240758]
Adversarial training and its many variants substantially improve deep network robustness, yet at the cost of compromising standard accuracy.
This paper asks how to quickly calibrate a trained model in-situ, to examine the achievable trade-offs between its standard and robust accuracies.
Our proposed framework, Once-for-all Adversarial Training (OAT), is built on an innovative model-conditional training framework.
arXiv Detail & Related papers (2020-10-22T16:06:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.