UniPELT: A Unified Framework for Parameter-Efficient Language Model
Tuning
- URL: http://arxiv.org/abs/2110.07577v1
- Date: Thu, 14 Oct 2021 17:40:08 GMT
- Title: UniPELT: A Unified Framework for Parameter-Efficient Language Model
Tuning
- Authors: Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei
Han, Wen-tau Yih, Madian Khabsa
- Abstract summary: We propose a unified framework, UniPELT, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup.
Remarkably, on the GLUE benchmark, UniPELT consistently achieves 13pt gains compared to the best individual PELT method that it incorporates and even outperforms fine-tuning under different setups.
- Score: 64.638804236566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional fine-tuning of pre-trained language models tunes all model
parameters and stores a full model copy for each downstream task, which has
become increasingly infeasible as the model size grows larger. Recent
parameter-efficient language model tuning (PELT) methods manage to match the
performance of fine-tuning with much fewer trainable parameters and perform
especially well when the training data is limited. However, different PELT
methods may perform rather differently on the same task, making it nontrivial
to select the most appropriate method for a specific task, especially
considering the fast-growing number of new PELT methods and downstream tasks.
In light of model diversity and the difficulty of model selection, we propose a
unified framework, UniPELT, which incorporates different PELT methods as
submodules and learns to activate the ones that best suit the current data or
task setup. Remarkably, on the GLUE benchmark, UniPELT consistently achieves
1~3pt gains compared to the best individual PELT method that it incorporates
and even outperforms fine-tuning under different setups. Moreover, UniPELT
often surpasses the upper bound when taking the best performance of all its
submodules used individually on each task, indicating that a mixture of
multiple PELT methods may be inherently more effective than single methods.
Related papers
- Dynamic Subset Tuning: Expanding the Operational Range of Parameter-Efficient Training for Large Language Models [14.762222323897978]
We propose a novel parameter-efficient training (PET) method for large language models.
Unlike prior methods, this subset is not fixed in location but rather which parameters are modified over the course of training.
Our method enables a seamless scaling of the subset size across an arbitrary proportion of the total model size.
arXiv Detail & Related papers (2024-11-13T13:53:10Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models [19.163639128631534]
Importance-aware Sparse Tuning (IST) is a plug-and-play technique compatible with various PEFT methods that operate on a per-layer basis.
IST dynamically updates selected layers in PEFT modules, leading to reduced memory demands.
arXiv Detail & Related papers (2024-10-15T16:53:26Z) - Merging Multi-Task Models via Weight-Ensembling Mixture of Experts [64.94129594112557]
Merging Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently.
Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable.
We propose to merge most of the parameters while upscaling the Transformer layers to a weight-ensembling mixture of experts (MoE) module.
arXiv Detail & Related papers (2024-02-01T08:58:57Z) - Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning [30.251155072822055]
Prototype-based HyperAdapter (PHA) is a novel framework built on the adapter-tuning and hypernetwork.
It introduces an instance-dense retriever and prototypical hypernetwork to generate conditional modules in a sample-efficient manner.
We show that PHA strikes a better trade-off between trainable parameters, accuracy on stream tasks, and sample efficiency.
arXiv Detail & Related papers (2023-10-18T02:42:17Z) - Parameter Efficient Multi-task Model Fusion with Partial Linearization [97.23530944186078]
We propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques.
Our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters.
We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model.
arXiv Detail & Related papers (2023-10-07T08:55:54Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - WARP: Word-level Adversarial ReProgramming [13.08689221166729]
In many applications it is preferable to tune much smaller sets of parameters, so that the majority of parameters can be shared across multiple tasks.
We present an alternative approach based on adversarial reprogramming, which extends earlier work on automatic prompt generation.
We show that this approach outperforms other methods with a similar number of trainable parameters on SST-2 and MNLI datasets.
arXiv Detail & Related papers (2021-01-01T00:41:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.