Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks
- URL: http://arxiv.org/abs/2401.02731v3
- Date: Mon, 12 Feb 2024 02:20:30 GMT
- Title: Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks
- Authors: Haoyuan Wu, Haisheng Zheng, Zhuolun He, Bei Yu
- Abstract summary: We introduce.
-Efficient Sparsity Crafting (PESC), which transitions dense models to sparse models.
PESC integrates adapters into the MoE layers of sparse models, differentiating experts without altering individual weights within these layers.
Our sparse models, dubbed Camelidae, outperform all other opensource sparse models and exhibit superior general capabilities compared to GPT3.5.
- Score: 6.048370838631722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated considerable proficiency in
general natural language processing (NLP) tasks. Instruction tuning, a
successful paradigm, enhances the ability of LLMs to follow natural language
instructions and exhibit robust generalization across a wide range of tasks.
However, these models often encounter performance limitations across multiple
tasks due to constrained model capacity. Expanding this capacity during the
instruction tuning phase poses significant challenges. To address this issue,
we introduce a novel approach, Parameter-Efficient Sparsity Crafting (PESC),
which transitions dense models to sparse models using a Mixture of Experts
(MoE) architecture. PESC integrates adapters into the MoE layers of sparse
models, differentiating experts without altering the individual weights within
these layers. This method significantly reduces computational costs and GPU
memory requirements, facilitating model capacity expansion through a minimal
increase in parameters via the inserted adapters. Our empirical evaluation
demonstrates the effectiveness of the PESC method. Using PESC during
instruction tuning, our sparse models, dubbed Camelidae outperform all other
opensource sparse models and exhibit superior general capabilities compared to
GPT3.5.
Related papers
- SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension [6.602323571343169]
Integrated during the fine-tuning phase of pre-trained generative language models (PLMs), QASE significantly enhances their performance.
The efficacy of the QASE module has been rigorously tested across various datasets.
arXiv Detail & Related papers (2024-04-27T19:42:51Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.