Related papers: Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

URL: http://arxiv.org/abs/2211.01979v1
Date: Tue, 18 Oct 2022 15:20:44 GMT
Title: Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters
Authors: Hongyu Zhao, Hao Tan and Hongyuan Mei
Abstract summary: Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. In this paper, we investigate the effectiveness of using tiny-attention -- i.e., attention with extremely small per-head dimensionality -- as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions.
Score: 25.958600375299735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. Previously proposed adapter architectures are all feed-forward neural networks. In this paper, we investigate the effectiveness of using tiny-attention -- i.e., attention with extremely small per-head dimensionality -- as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters. Moreover, we view its multiple attention heads as a mixture of experts and propose to average their weights during deployment, which further reduces its inference computation cost. On the GLUE benchmark, our tiny-attention adapter outperforms the other parameter-efficient transfer learning methods as well as full fine-tuning while only updating 0.05% of the parameters. On the FewGLUE benchmark, its performance is comparable to that of GPT-3 and PET.

Related papers

Towards Optimal Adapter Placement for Efficient Transfer Learning [73.1149084352343]
PETL aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections. This paper investigates the relationship between the placement of an adapter and its performance.
arXiv Detail & Related papers (2024-10-21T10:37:17Z)
Hadamard Adapter: An Extreme Parameter-Efficient Adapter Tuning Method for Pre-trained Language Models [108.08773541490191]
Pre-trained Language models (PLMs) have a huge amount of parameters, fine-tuning them is often expensive and time consuming. It is necessary to adopt a parameter-efficient approach to reduce parameters of PLMs in fine-tuning without compromising their performance in downstream tasks. In this paper, we design a novel adapter which only acts on self-attention outputs in PLMs.
arXiv Detail & Related papers (2024-07-04T18:21:28Z)
Mini but Mighty: Finetuning ViTs with Mini Adapters [7.175668563148084]
adapters perform poorly when the dimension of adapters is small. We propose MiMi, a training framework that addresses this issue. Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters.
arXiv Detail & Related papers (2023-11-07T10:41:27Z)
MerA: Merging Pretrained Adapters For Few-Shot Learning [71.44422347502409]
We propose textbftextttMerging Pretrained Adapters (MerA) that efficiently incorporates pretrained adapters to a single model through model fusion. Experiments on two PLMs demonstrate that MerA substantial improvements compared to both single adapters and AdapterFusion.
arXiv Detail & Related papers (2023-08-30T12:10:17Z)
Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy [17.203320079872952]
Current state-of-the-art results in computer vision depend in part on fine-tuning large pre-trained vision models. With the exponential growth of model sizes, the conventional full fine-tuning leads to increasingly huge storage and transmission overhead. In this paper, we investigate how to make adapters even more efficient, reaching a new minimum size required to store a task-specific fine-tuned network.
arXiv Detail & Related papers (2023-07-31T17:22:17Z)
Towards Efficient Visual Adaption via Structural Re-parameterization [76.57083043547296]
We propose a parameter-efficient and computational friendly adapter for giant vision models, called RepAdapter. RepAdapter outperforms full tuning by +7.2% on average and saves up to 25% training time, 20% GPU memory, and 94.6% storage cost of ViT-B/16 on VTAB-1k.
arXiv Detail & Related papers (2023-02-16T06:14:15Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)
AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage. Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters. In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z)
On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation [36.37565646597464]
adapter-based tuning works by adding light-weight adapter modules to a pretrained language model (PrLM) It adds only a few trainable parameters per new task, allowing a high degree of parameter sharing. We demonstrate that adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks.
arXiv Detail & Related papers (2021-06-06T16:10:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.