Understanding Emergent In-Context Learning from a Kernel Regression Perspective
- URL: http://arxiv.org/abs/2305.12766v3
- Date: Fri, 12 Sep 2025 22:18:33 GMT
- Title: Understanding Emergent In-Context Learning from a Kernel Regression Perspective
- Authors: Chi Han, Ziqi Wang, Han Zhao, Heng Ji,
- Abstract summary: Large language models (LLMs) have initiated a paradigm shift in transfer learning.<n>This paper proposes a kernel-regression perspective of understanding LLMs' ICL bahaviors when faced with in-context examples.<n>We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
- Score: 55.95455089638838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have initiated a paradigm shift in transfer learning. In contrast to the classic pretraining-then-finetuning procedure, in order to use LLMs for downstream prediction tasks, one only needs to provide a few demonstrations, known as in-context examples, without adding more or updating existing model parameters. This in-context learning (ICL) capability of LLMs is intriguing, and it is not yet fully understood how pretrained LLMs acquire such capabilities. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training on a general language corpus by proposing a kernel-regression perspective of understanding LLMs' ICL bahaviors when faced with in-context examples. More concretely, we first prove that Bayesian inference on in-context prompts can be asymptotically understood as kernel regression $\hat y = \sum_i y_i K(x, x_i)/\sum_i K(x, x_i)$ as the number of in-context demonstrations grows. Then, we empirically investigate the in-context behaviors of language models. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression. Finally, our theory provides insights into multiple phenomena observed in the ICL field: why retrieving demonstrative samples similar to test samples can help, why ICL performance is sensitive to the output formats, and why ICL accuracy benefits from selecting in-distribution and representative samples. Code and resources are publicly available at https://github.com/Glaciohound/Explain-ICL-As-Kernel-Regression.
Related papers
- Surprise Calibration for Better In-Context Learning [6.566285172635043]
In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models.<n>Existing bias calibration methods apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings.<n>We introduce a novel method-Surprise (SC), which captures the temporal dynamics of class priors.
arXiv Detail & Related papers (2025-06-15T10:04:42Z) - Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning [50.53703102032562]
Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks.<n>The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood.
arXiv Detail & Related papers (2025-05-16T08:50:42Z) - Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization [26.9153121765435]
Large language models (LLMs) have demonstrated remarkable in-context learning abilities.<n>This paper investigates how ICL emerges and the impact of pre-training phase on ICL.<n>Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.
arXiv Detail & Related papers (2025-02-24T10:26:29Z) - Bayesian scaling laws for in-context learning [85.34114399339741]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.<n>We show that ICL approximates a Bayesian learner, which gives rise to a novel Bayesian scaling law for ICL.<n>Our scaling law matches existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks [25.562937159039038]
In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the dominant technique for performing natural language tasks.
We show that ICL relies mostly on the retrieval of task priors and less so on "learning" to perform tasks.
We find that, surprisingly, Chain-of-Thought (CoT) indeed suffers from the same posterior collapse as ICL for larger language models.
arXiv Detail & Related papers (2024-09-10T03:06:17Z) - Verbalized Machine Learning: Revisiting Machine Learning with Language Models [63.10391314749408]
We introduce the framework of verbalized machine learning (VML)
VML constrains the parameter space to be human-interpretable natural language.
We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability.
arXiv Detail & Related papers (2024-06-06T17:59:56Z) - What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings.
We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs.
We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z) - What Do Language Models Learn in Context? The Structured Task Hypothesis [89.65045443150889]
Large language models (LLMs) learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL)
One popular hypothesis explains ICL by task selection.
Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration.
arXiv Detail & Related papers (2024-06-06T16:15:34Z) - Implicit In-context Learning [37.0562059811099]
In-context Learning (ICL) empowers large language models to adapt to unseen tasks during inference by prefixing a few demonstration examples prior to test queries.
We introduce Implicit In-context Learning (I2CL), an innovative paradigm that addresses the challenges associated with traditional ICL by absorbing demonstration examples within the activation space.
I2CL achieves few-shot performance with zero-shot cost and exhibits robustness against the variation of demonstration examples.
arXiv Detail & Related papers (2024-05-23T14:57:52Z) - "In-Context Learning" or: How I learned to stop worrying and love "Applied Information Retrieval" [9.264121218481133]
In-context learning (ICL) has evolved as a new paradigm for natural language processing (NLP)
ICL is conceptually similar to a non-parametric approach, such as $k$-NN.
Similar examples in ICL retrieved from a training set relate to a set of documents retrieved from a collection in IR.
arXiv Detail & Related papers (2024-05-02T09:25:24Z) - Parallel Structures in Pre-training Data Yield In-Context Learning [41.27837171531926]
We study what patterns of the pre-training data contribute to in-context learning (ICL)
We find that LMs' ICL ability depends on $textitparallel structures$ in the pre-training data.
arXiv Detail & Related papers (2024-02-19T20:40:48Z) - In-Context Exemplars as Clues to Retrieving from Large Associative
Memory [1.2952137350423816]
In-context learning (ICL) enables large language models (LLMs) to learn patterns from in-context exemplars without training.
How to choose exemplars remains unclear due to the lack of understanding of how in-context learning works.
Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval.
arXiv Detail & Related papers (2023-11-06T20:13:29Z) - Do pretrained Transformers Learn In-Context by Gradient Descent? [21.23795112800977]
In this paper, we investigate the emergence of In-Context Learning (ICL) in language models pre-trained on natural data (LLaMa-7B)
We find that ICL and Gradient Descent (GD) modify the output distribution of language models differently.
These results indicate that emphthe equivalence between ICL and GD remains an open hypothesis and calls for further studies.
arXiv Detail & Related papers (2023-10-12T17:32:09Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z) - Label Words are Anchors: An Information Flow Perspective for
Understanding In-Context Learning [77.7070536959126]
In-context learning (ICL) emerges as a promising capability of large language models (LLMs)
In this paper, we investigate the working mechanism of ICL through an information flow lens.
We introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL.
arXiv Detail & Related papers (2023-05-23T15:26:20Z) - A Theory of Emergent In-Context Learning as Implicit Structure Induction [8.17811111226145]
Scaling large language models leads to an emergent capacity to learn in-context from example demonstrations.
We argue that in-context learning relies on recombination of compositional operations found in natural language data.
We show how in-context learning is supported by a representation of the input's compositional structure.
arXiv Detail & Related papers (2023-03-14T15:24:05Z) - ThinkSum: Probabilistic reasoning over sets using large language models [18.123895485602244]
We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner.
We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks.
arXiv Detail & Related papers (2022-10-04T00:34:01Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.