Hidden State Variability of Pretrained Language Models Can Guide
Computation Reduction for Transfer Learning
- URL: http://arxiv.org/abs/2210.10041v2
- Date: Wed, 19 Oct 2022 01:22:12 GMT
- Title: Hidden State Variability of Pretrained Language Models Can Guide
Computation Reduction for Transfer Learning
- Authors: Shuo Xie, Jiahao Qiu, Ankita Pasad, Li Du, Qing Qu, Hongyuan Mei
- Abstract summary: We investigate whether one could make a task-specific selection on which subset of the layers to adapt.
We propose to select layers based on the variability of their hidden states given a task-specific corpus.
- Score: 16.60284838029852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While transferring a pretrained language model, common approaches
conventionally attach their task-specific classifiers to the top layer and
adapt all the pretrained layers. We investigate whether one could make a
task-specific selection on which subset of the layers to adapt and where to
place the classifier. The goal is to reduce the computation cost of transfer
learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its
performance.
We propose to select layers based on the variability of their hidden states
given a task-specific corpus. We say a layer is already "well-specialized" in a
task if the within-class variability of its hidden states is low relative to
the between-class variability. Our variability metric is cheap to compute and
doesn't need any training or hyperparameter tuning. It is robust to data
imbalance and data scarcity. Extensive experiments on the GLUE benchmark
demonstrate that selecting layers based on our metric can yield significantly
stronger performance than using the same number of top layers and often match
the performance of fine-tuning or adapter-tuning the entire language model.
Related papers
- Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning [5.119396962985841]
Intermediate task transfer learning can greatly improve model performance.
We conduct the largest study on NLP task transferability and task selection with 12k source-target pairs.
Applying ESMs on a prior method reduces execution time and disk space usage by factors of 10 and 278, respectively.
arXiv Detail & Related papers (2024-10-19T16:22:04Z) - CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning [101.81127587760831]
Current fine-tuning methods build adapters widely of the context of downstream task to learn, or the context of important knowledge to maintain.
We propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable task-aware adapters.
Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation.
arXiv Detail & Related papers (2024-06-07T19:10:35Z) - On Surgical Fine-tuning for Language Encoders [2.3796105472622813]
We show that for different downstream language tasks, fine-tuning only a subset of layers is sufficient to obtain performance that is close to and often better than fine-tuning all the layers in the language encoder.
We propose an efficient metric based on the diagonal of the Fisher information matrix (FIM score) to select the candidate layers for selective fine-tuning.
arXiv Detail & Related papers (2023-10-25T22:42:30Z) - Parameter-Efficient Tuning by Manipulating Hidden States of Pretrained
Language Models For Classification Tasks [49.807185872741066]
We propose a simple tuning method which only introduces three trainable vectors.
We input the integrated hidden state(s) to a task-specific linear classifier to predict categories.
This scheme is similar to the way ELMo utilises hidden states except that they feed the hidden states to LSTM-based models.
arXiv Detail & Related papers (2022-04-10T04:14:02Z) - Composable Sparse Fine-Tuning for Cross-Lingual Transfer [56.86192078426372]
Fine-tuning all parameters of a pre-trained model has become the mainstream approach for transfer learning.
We introduce a new fine-tuning method with both these desirable properties.
It outperforms adapters in zero-shot cross-lingual transfer by a large margin.
arXiv Detail & Related papers (2021-10-14T17:27:29Z) - Robust Transfer Learning with Pretrained Language Models through
Adapters [40.45102278979193]
Transfer learning with large pretrained language models like BERT has become a dominating approach for most NLP tasks.
We propose a simple yet effective adapter-based approach to mitigate these issues.
Our experiments demonstrate that such a training scheme leads to improved stability and adversarial robustness in transfer learning to various downstream tasks.
arXiv Detail & Related papers (2021-08-05T02:30:13Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Partial Is Better Than All: Revisiting Fine-tuning Strategy for Few-shot
Learning [76.98364915566292]
A common practice is to train a model on the base set first and then transfer to novel classes through fine-tuning.
We propose to transfer partial knowledge by freezing or fine-tuning particular layer(s) in the base model.
We conduct extensive experiments on CUB and mini-ImageNet to demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2021-02-08T03:27:05Z) - Parameter-Efficient Transfer Learning with Diff Pruning [108.03864629388404]
diff pruning is a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework.
We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark.
arXiv Detail & Related papers (2020-12-14T12:34:01Z) - Investigating Transferability in Pretrained Language Models [8.83046338075119]
We consider a simple ablation technique for determining the impact of each pretrained layer on transfer task performance.
This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks.
arXiv Detail & Related papers (2020-04-30T17:23:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.