Related papers: Understanding In-context Learning of Addition via Activation Subspaces

Understanding In-context Learning of Addition via Activation Subspaces

URL: http://arxiv.org/abs/2505.05145v3
Date: Thu, 09 Oct 2025 17:58:05 GMT
Title: Understanding In-context Learning of Addition via Activation Subspaces
Authors: Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen,
Abstract summary: We study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input.<n>We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition.<n>Our results demonstrate how tracking low-dimensional subspaces of localized heads across a forward pass can provide insight into fine-grained computational structures in language models.
Score: 73.8295576941241
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: To perform few-shot learning, language models extract signals from a few input-label pairs, aggregate these into a learned prediction rule, and apply this rule to new inputs. How is this implemented in the forward pass of modern transformer models? To explore this question, we study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We introduce a novel optimization method that localizes the model's few-shot ability to only a few attention heads. We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition. As an example, on Llama-3-8B-instruct, we reduce its mechanism on our tasks to just three attention heads with six-dimensional subspaces, where four dimensions track the unit digit with trigonometric functions at periods $2$, $5$, and $10$, and two dimensions track magnitude with low-frequency components. To deepen our understanding of the mechanism, we also derive a mathematical identity relating ``aggregation'' and ``extraction'' subspaces for attention heads, allowing us to track the flow of information from individual examples to a final aggregated concept. Using this, we identify a self-correction mechanism where mistakes learned from earlier demonstrations are suppressed by later demonstrations. Our results demonstrate how tracking low-dimensional subspaces of localized heads across a forward pass can provide insight into fine-grained computational structures in language models.

Related papers

Learning Without Training [0.0]
This dissertation focuses on three different projects rooted in mathematical theory for machine learning applications.<n>The first project deals with supervised learning and manifold learning.<n>The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain.<n>The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm.
arXiv Detail & Related papers (2026-02-20T04:42:06Z)
ELROND: Exploring and decomposing intrinsic capabilities of diffusion models [3.656403721249365]
A single text prompt passed to a diffusion model often yields a wide range of visual outputs determined solely by process.<n>We propose a framework to disentangle these semantic directions directly within the input embedding.
arXiv Detail & Related papers (2026-02-10T19:07:15Z)
Attention Layers Add Into Low-Dimensional Residual Subspaces [46.25442191251545]
We show that attention outputs are confined to a surprisingly low-dimensional subspace.<n>We find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning.
arXiv Detail & Related papers (2025-08-23T07:27:00Z)
A Markov Categorical Framework for Language Modeling [9.910562011343009]
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive.<n>We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories.<n>This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.
arXiv Detail & Related papers (2025-07-25T13:14:03Z)
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models [32.2976613483151]
We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task.<n>We find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers.
arXiv Detail & Related papers (2024-06-13T18:12:01Z)
Arithmetic in Transformers Explained [1.8434042562191815]
We analyze 44 autoregressive transformer models trained on addition, subtraction, or both.<n>We show that the addition models converge on a common logical algorithm, with most models achieving >99.999% prediction accuracy.<n>We introduce a reusable library of mechanistic interpretability tools to define, locate, and visualize these algorithmic circuits.
arXiv Detail & Related papers (2024-02-04T21:33:18Z)
Ticketed Learning-Unlearning Schemes [57.89421552780526]
We propose a new ticketed model for learning--unlearning. We provide space-efficient ticketed learning--unlearning schemes for a broad family of concept classes.
arXiv Detail & Related papers (2023-06-27T18:54:40Z)
Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning. The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
Gaussian Switch Sampling: A Second Order Approach to Active Learning [11.775252660867285]
In active learning, acquisition functions define informativeness directly on the representation position within the model manifold. We propose a grounded second-order definition of information content and sample importance within the context of active learning. We show that our definition produces highly accurate importance scores even when the model representations are constrained by the lack of training data.
arXiv Detail & Related papers (2023-02-16T15:24:56Z)
Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features [119.22672589020394]
We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$2$ improves performance by 5-15% when given limited target data.
arXiv Detail & Related papers (2023-02-10T18:58:03Z)
ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z)
Understanding Self-Predictive Learning for Reinforcement Learning [61.62067048348786]
We study the learning dynamics of self-predictive learning for reinforcement learning. We propose a novel self-predictive algorithm that learns two representations simultaneously.
arXiv Detail & Related papers (2022-12-06T20:43:37Z)
What learning algorithm is in-context learning? Investigations with linear models [87.91612418166464]
We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly. We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression. Preliminary evidence that in-context learners share algorithmic features with these predictors.
arXiv Detail & Related papers (2022-11-28T18:59:51Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models. Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z)
Deep Learning Through the Lens of Example Difficulty [21.522182447513632]
We introduce a measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth. Our investigation reveals surprising yet simple relationships between the prediction depth of a given input and the model's uncertainty, confidence, accuracy and speed of learning for that data point.
arXiv Detail & Related papers (2021-06-17T16:48:12Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.