Rethinking the Role of Scale for In-Context Learning: An
Interpretability-based Case Study at 66 Billion Scale
- URL: http://arxiv.org/abs/2212.09095v2
- Date: Wed, 16 Aug 2023 09:09:53 GMT
- Title: Rethinking the Role of Scale for In-Context Learning: An
Interpretability-based Case Study at 66 Billion Scale
- Authors: Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan
Bodapati, Katrin Kirchhoff, Dan Roth
- Abstract summary: We investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across its underlying components.
We find substantial overlap in the set of attention heads (unimportant) for in-context learning across tasks and number of in-context examples.
- Score: 60.336655143884904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models have been shown to perform better with an increase in scale
on a wide variety of tasks via the in-context learning paradigm. In this paper,
we investigate the hypothesis that the ability of a large language model to
in-context learn-perform a task is not uniformly spread across all of its
underlying components. Using a 66 billion parameter language model (OPT-66B)
across a diverse set of 14 downstream tasks, we find this is indeed the case:
$\sim$70% of attention heads and $\sim$20% of feed forward networks can be
removed with minimal decline in task performance. We find substantial overlap
in the set of attention heads (un)important for in-context learning across
tasks and number of in-context examples. We also address our hypothesis through
a task-agnostic lens, finding that a small set of attention heads in OPT-66B
score highly on their ability to perform primitive induction operations
associated with in-context learning, namely, prefix matching and copying. These
induction heads overlap with task-specific important heads, reinforcing
arguments by Olsson et al. (arXiv:2209.11895) regarding induction head
generality to more sophisticated behaviors associated with in-context learning.
Overall, our study provides several insights that indicate large language
models may be under-trained for in-context learning and opens up questions on
how to pre-train language models to more effectively perform in-context
learning.
Related papers
- Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - In-context Learning in Presence of Spurious Correlations [8.055478206164105]
We study the possibility of training an in-context learner for classification tasks involving spurious features.
We find that the conventional approach of training in-context learners is susceptible to spurious features.
We propose a novel technique to train such a learner for a given classification task.
arXiv Detail & Related papers (2024-10-04T04:26:36Z) - Teaching Smaller Language Models To Generalise To Unseen Compositional
Questions [6.9076450524134145]
We propose a combination of multitask pretraining on up to 93 tasks designed to instill diverse reasoning abilities.
We show that performance can be significantly improved by adding retrieval-augmented training datasets.
arXiv Detail & Related papers (2023-08-02T05:00:12Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - EXnet: Efficient In-context Learning for Data-less Text classification [0.0]
We present EXnet, a model specifically designed to perform in-context learning without limitations on the number of examples.
We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization.
With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.
arXiv Detail & Related papers (2023-05-24T01:40:57Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - The Learnability of In-Context Learning [16.182561312622315]
We propose a first-of-its-kind PAC based framework for in-context learnability.
Our framework includes an initial pretraining phase, which fits a function to the pretraining distribution.
We show that in-context learning is more about identifying the task than about learning it.
arXiv Detail & Related papers (2023-03-14T13:28:39Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Coarse-to-Fine: Hierarchical Multi-task Learning for Natural Language
Understanding [51.31622274823167]
We propose a hierarchical framework with a coarse-to-fine paradigm, with the bottom level shared to all the tasks, the mid-level divided to different groups, and the top-level assigned to each of the tasks.
This allows our model to learn basic language properties from all tasks, boost performance on relevant tasks, and reduce the negative impact from irrelevant tasks.
arXiv Detail & Related papers (2022-08-19T02:46:20Z) - Intermediate-Task Transfer Learning with Pretrained Models for Natural
Language Understanding: When and Why Does It Work? [44.88358841370665]
It is poorly understood when and why intermediate-task training is beneficial for a given target task.
We perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations.
We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best.
arXiv Detail & Related papers (2020-05-01T21:49:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.