AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models
- URL: http://arxiv.org/abs/2205.12410v1
- Date: Tue, 24 May 2022 23:41:22 GMT
- Title: AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models
- Authors: Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed
Hassan Awadallah, Jianfeng Gao
- Abstract summary: Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
- Score: 119.7093605087114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large-scale pre-trained language models to downstream tasks
require updating hundreds of millions of parameters. This not only increases
the serving cost to store a large copy of the model weights for every task, but
also exhibits instability during few-shot task adaptation. Parameter-efficient
techniques have been developed that tune small trainable components (e.g.,
adapters) injected in the large model while keeping most of the model weights
frozen. The prevalent mechanism to increase adapter capacity is to increase the
bottleneck dimension which increases the adapter parameters. In this work, we
introduce a new mechanism to improve adapter capacity without increasing
parameters or computational cost by two key techniques. (i) We introduce
multiple shared adapter components in each layer of the Transformer
architecture. We leverage sparse learning via random routing to update the
adapter parameters (encoder is kept frozen) resulting in the same amount of
computational cost (FLOPs) as that of training a single adapter. (ii) We
propose a simple merging mechanism to average the weights of multiple adapter
components to collapse to a single adapter in each Transformer layer, thereby,
keeping the overall parameters also the same but with significant performance
improvement. We demonstrate these techniques to work well across multiple task
settings including fully supervised and few-shot Natural Language Understanding
tasks. By only tuning 0.23% of a pre-trained language model's parameters, our
model outperforms the full model fine-tuning performance and several competing
methods.
Related papers
- Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision [52.80792724919329]
We introduce a novel framework named Adapter-X to improve fine-tuning in 2D image and 3D point cloud modalities.
It is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks.
arXiv Detail & Related papers (2024-06-05T08:26:44Z) - Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models [12.230087530720652]
We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario.
The adapter consists of a single shared controller network and multiple task-level adapter heads.
arXiv Detail & Related papers (2024-03-25T17:21:56Z) - MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning [20.68925288222065]
Mixture of Sparse Adapters, or MoSA, is a novel Adapter Tuning method.
MoSA can achieve significantly better performance than standard without any additional computational storage overhead.
MoSA consistently outperforms other Adapter Tuning methods as well as other baselines by a large margin.
arXiv Detail & Related papers (2023-12-05T17:50:55Z) - MerA: Merging Pretrained Adapters For Few-Shot Learning [71.44422347502409]
We propose textbftextttMerging Pretrained Adapters (MerA) that efficiently incorporates pretrained adapters to a single model through model fusion.
Experiments on two PLMs demonstrate that MerA substantial improvements compared to both single adapters and AdapterFusion.
arXiv Detail & Related papers (2023-08-30T12:10:17Z) - Consolidator: Mergeable Adapter with Grouped Connections for Visual
Adaptation [53.835365470800916]
We show how to efficiently and effectively transfer knowledge in a vision transformer.
We propose consolidator to modify the pre-trained model with the addition of a small set of tunable parameters.
Our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters.
arXiv Detail & Related papers (2023-04-30T23:59:02Z) - AdapterBias: Parameter-efficient Token-dependent Representation Shift
for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage.
Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters.
In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z) - AdapterHub: A Framework for Adapting Transformers [148.6877231725939]
AdapterHub is a framework that allows dynamic "stitching-in" of pre-trained adapters for different tasks and languages.
Our framework enables scalable and easy access to sharing of task-specific models.
arXiv Detail & Related papers (2020-07-15T15:56:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.