Task-Specific Skill Localization in Fine-tuned Language Models
- URL: http://arxiv.org/abs/2302.06600v2
- Date: Sun, 2 Jul 2023 01:55:00 GMT
- Title: Task-Specific Skill Localization in Fine-tuned Language Models
- Authors: Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, Sanjeev Arora
- Abstract summary: This paper introduces the term skill localization for this problem.
A simple optimization is used to identify a very small subset of parameters.
grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model.
- Score: 36.53572616441048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models can be fine-tuned to solve diverse NLP tasks,
including in few-shot settings. Thus fine-tuning allows the model to quickly
pick up task-specific ``skills,'' but there has been limited study of where
these newly-learnt skills reside inside the massive model. This paper
introduces the term skill localization for this problem and proposes a
solution. Given the downstream task and a model fine-tuned on that task, a
simple optimization is used to identify a very small subset of parameters
($\sim0.01$% of model parameters) responsible for ($>95$%) of the model's
performance, in the sense that grafting the fine-tuned values for just this
tiny subset onto the pre-trained model gives performance almost as well as the
fine-tuned model. While reminiscent of recent works on parameter-efficient
fine-tuning, the novel aspects here are that: (i) No further re-training is
needed on the subset (unlike, say, with lottery tickets). (ii) Notable
improvements are seen over vanilla fine-tuning with respect to calibration of
predictions in-distribution ($40$-$90$% error reduction) as well as the quality
of predictions out-of-distribution (OOD). In models trained on multiple tasks,
a stronger notion of skill localization is observed, where the sparse regions
corresponding to different tasks are almost disjoint, and their overlap (when
it happens) is a proxy for task similarity. Experiments suggest that
localization via grafting can assist certain forms of continual learning.
Related papers
- Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large
Language Models [46.92994945808424]
Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs)
This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor.
arXiv Detail & Related papers (2024-02-19T11:02:05Z) - Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks [37.278707106871295]
We study how fine-tuning alters the underlying capabilities learned by a model during pretraining.
We show that fine-tuning rarely alters the underlying model capabilities.
We also show that fine-tuning can unintentionally remove a model's safety wrapper.
arXiv Detail & Related papers (2023-11-21T18:51:04Z) - FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained
Models in Few-Shot Learning [21.693779973263172]
In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align)
Our method aims to bolster the model's generalizability by preserving the consistency of spurious features.
Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements.
arXiv Detail & Related papers (2023-10-23T17:12:01Z) - Preserving Pre-trained Features Helps Calibrate Fine-tuned Language
Models [23.881825575095945]
Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning.
However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings.
We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift.
We show that preserving pre-trained features can improve the calibration of fine-tuned language models.
arXiv Detail & Related papers (2023-05-30T17:35:31Z) - Skill-Based Few-Shot Selection for In-Context Learning [123.26522773708683]
Skill-KNN is a skill-based few-shot selection method for in-context learning.
It does not require training or fine-tuning of any models, making it suitable for frequently expanding or changing example banks.
Experimental results across five cross-domain semantic parsing datasets and six backbone models show that Skill-KNN significantly outperforms existing methods.
arXiv Detail & Related papers (2023-05-23T16:28:29Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained
Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings.
We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Sufficiently Accurate Model Learning for Planning [119.80502738709937]
This paper introduces the constrained Sufficiently Accurate model learning approach.
It provides examples of such problems, and presents a theorem on how close some approximate solutions can be.
The approximate solution quality will depend on the function parameterization, loss and constraint function smoothness, and the number of samples in model learning.
arXiv Detail & Related papers (2021-02-11T16:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.