An Explanation of In-context Learning as Implicit Bayesian Inference
- URL: http://arxiv.org/abs/2111.02080v1
- Date: Wed, 3 Nov 2021 09:12:33 GMT
- Title: An Explanation of In-context Learning as Implicit Bayesian Inference
- Authors: Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma
- Abstract summary: We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
- Score: 117.19809377740188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large pretrained language models such as GPT-3 have the surprising ability to
do in-context learning, where the model learns to do a downstream task simply
by conditioning on a prompt consisting of input-output examples. Without being
explicitly pretrained to do so, the language model learns from these examples
during its forward pass without parameter updates on "out-of-distribution"
prompts. Thus, it is unclear what mechanism enables in-context learning. In
this paper, we study the role of the pretraining distribution on the emergence
of in-context learning under a mathematical setting where the pretraining texts
have long-range coherence. Here, language model pretraining requires inferring
a latent document-level concept from the conditioning text to generate coherent
next tokens. At test time, this mechanism enables in-context learning by
inferring the shared latent concept between prompt examples and applying it to
make a prediction on the test example. Concretely, we prove that in-context
learning occurs implicitly via Bayesian inference of the latent concept when
the pretraining distribution is a mixture of HMMs. This can occur despite the
distribution mismatch between prompts and pretraining data. In contrast to
messy large-scale pretraining datasets for in-context learning in natural
language, we generate a family of small-scale synthetic datasets (GINC) where
Transformer and LSTM language models both exhibit in-context learning. Beyond
the theory which focuses on the effect of the pretraining distribution, we
empirically find that scaling model size improves in-context accuracy even when
the pretraining loss is the same.
Related papers
- Toward Understanding In-context vs. In-weight Learning [50.24035812301655]
We identify simplified distributional properties that give rise to the emergence and disappearance of in-context learning.
We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.
arXiv Detail & Related papers (2024-10-30T14:09:00Z) - The mechanistic basis of data dependence and abrupt learning in an
in-context classification task [0.3626013617212666]
We show that specific distributional properties inherent in language control the trade-off or simultaneous appearance of two forms of learning.
In-context learning is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning.
We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL.
arXiv Detail & Related papers (2023-12-03T20:53:41Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.
In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training.
We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - A Theory of Emergent In-Context Learning as Implicit Structure Induction [8.17811111226145]
Scaling large language models leads to an emergent capacity to learn in-context from example demonstrations.
We argue that in-context learning relies on recombination of compositional operations found in natural language data.
We show how in-context learning is supported by a representation of the input's compositional structure.
arXiv Detail & Related papers (2023-03-14T15:24:05Z) - The Learnability of In-Context Learning [16.182561312622315]
We propose a first-of-its-kind PAC based framework for in-context learnability.
Our framework includes an initial pretraining phase, which fits a function to the pretraining distribution.
We show that in-context learning is more about identifying the task than about learning it.
arXiv Detail & Related papers (2023-03-14T13:28:39Z) - The Inductive Bias of In-Context Learning: Rethinking Pretraining
Example Design [34.900425311720795]
We show that pretrained NLM can model stronger dependencies between text segments that appeared in the same training example, than it can between text segments that appeared in different training examples.
We propose "kNN-Pretraining": we show that including semantically related non-neighboring sentences in the same pretraining example yields improved sentence representations and open domain question answering abilities.
arXiv Detail & Related papers (2021-10-09T11:05:16Z) - How Context Affects Language Models' Factual Predictions [134.29166998377187]
We integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way.
We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
arXiv Detail & Related papers (2020-05-10T09:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.