Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
- URL: http://arxiv.org/abs/2505.14808v1
- Date: Tue, 20 May 2025 18:15:49 GMT
- Title: Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
- Authors: Soo Min Kwon, Alec S. Xu, Can Yaras, Laura Balzano, Qing Qu,
- Abstract summary: We demystify the out-of-distribution capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices.<n>We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts.<n>This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training.
- Score: 9.249642973141107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the training and testing covariance matrices. We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts. However, using this framework, we also prove an interesting property of ICL: when trained on task vectors drawn from a union of low-dimensional subspaces, ICL can generalize to any subspace within their span, given sufficiently long prompt lengths. This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training. We empirically show that our results also hold for models such as GPT-2, and conclude with (i) experiments on how our observations extend to nonlinear function classes and (ii) results on how LoRA has the ability to capture distribution shifts.
Related papers
- Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias [0.0]
We argue that failures arise from how models structure their internal representations during training.<n>We show analytically that bilinear parameterizations possess a non-mixing' property under gradient flow conditions.<n>Unlike pointwise nonlinear networks, multiplicative architectures are able to recover true operators aligned with the underlying algebraic structure.
arXiv Detail & Related papers (2026-02-05T13:14:01Z) - A Comedy of Estimators: On KL Regularization in RL Training of LLMs [81.7906270099878]
reinforcement learning (RL) can substantially improve the reasoning performance of large language models (LLMs)<n>The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy.<n>Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation.<n>We study the gradients of several estimators configurations, revealing how design choices shape gradient bias.
arXiv Detail & Related papers (2025-12-26T04:20:58Z) - T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning [15.016777234800585]
Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data.<n>Recent studies have highlighted two pivotal properties for effective representations.<n>We introduce T-REGS, a simple regularization framework for SSL based on the length of the Minimum Spanning Tree (MST) over the learned representation.
arXiv Detail & Related papers (2025-10-27T16:16:40Z) - DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning [63.65467569295623]
We propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges.<n>To conduct incremental learning effectively and yet efficiently, the DRL's network is built upon a PTM.<n>Our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period.
arXiv Detail & Related papers (2025-10-14T03:19:15Z) - Provable In-Context Vector Arithmetic via Retrieving Task Concepts [53.685764040547625]
We show how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic.<n>These results elucidate the advantages of transformers over static embedding predecessors.
arXiv Detail & Related papers (2025-08-13T13:54:44Z) - Bilinear Convolution Decomposition for Causal RL Interpretability [0.0]
Efforts to interpret reinforcement learning (RL) models often rely on high-level techniques such as attribution or probing.<n>This work proposes replacing nonlinearities in convolutional neural networks (ConvNets) with bilinear variants, to produce a class of models for which these limitations can be addressed.<n>We show bilinear model variants perform comparably in model-free reinforcement learning settings, and give a side by side comparison on ProcGen environments.
arXiv Detail & Related papers (2024-12-01T19:32:04Z) - Can In-context Learning Really Generalize to Out-of-distribution Tasks? [36.11431280689549]
We investigate the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training.<n>We reveal that Transformers may struggle to learn OOD task functions through ICL.
arXiv Detail & Related papers (2024-10-13T02:10:26Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly.
The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z) - Implicit variance regularization in non-contrastive SSL [7.573586022424398]
We analytically study learning dynamics in conjunction with Euclidean and cosine similarity in the eigenspace of closed-form linear predictor networks.
We propose a family of isotropic loss functions (IsoLoss) that equalize convergence rates across eigenmodes.
arXiv Detail & Related papers (2022-12-09T13:56:42Z) - Stochastic Mirror Descent in Average Ensemble Models [38.38572705720122]
The mirror descent (SMD) is a general class of training algorithms, which includes the celebrated gradient descent (SGD) as a special case.
In this paper we explore the performance of the mirror potential algorithm on mean-field ensemble models.
arXiv Detail & Related papers (2022-10-27T11:04:00Z) - Function Classes for Identifiable Nonlinear Independent Component
Analysis [10.828616610785524]
Unsupervised learning of latent variable models (LVMs) is widely used to represent data in machine learning.
Recent work suggests that constraining the function class of such models may promote identifiability.
We prove that a subclass of these transformations, conformal maps, is identifiable and provide novel theoretical results.
arXiv Detail & Related papers (2022-08-12T17:58:31Z) - Supervised learning of sheared distributions using linearized optimal
transport [64.53761005509386]
In this paper we study supervised learning tasks on the space of probability measures.
We approach this problem by embedding the space of probability measures into $L2$ spaces using the optimal transport framework.
Regular machine learning techniques are used to achieve linear separability.
arXiv Detail & Related papers (2022-01-25T19:19:59Z) - Meta Learning MDPs with Linear Transition Models [22.508479528847634]
We study meta-learning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting.
We propose BUC-MatrixRL, a version of the UC-Matrix RL algorithm, and show it can meaningfully leverage a set of sampled training tasks.
We prove that compared to learning the tasks in isolation, BUC-Matrix RL provides significant improvements in the transfer regret for high bias low variance task distributions.
arXiv Detail & Related papers (2022-01-21T14:57:03Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Learning Invariant Representations and Risks for Semi-supervised Domain
Adaptation [109.73983088432364]
We propose the first method that aims to simultaneously learn invariant representations and risks under the setting of semi-supervised domain adaptation (Semi-DA)
We introduce the LIRR algorithm for jointly textbfLearning textbfInvariant textbfRepresentations and textbfRisks.
arXiv Detail & Related papers (2020-10-09T15:42:35Z) - FLAMBE: Structural Complexity and Representation Learning of Low Rank
MDPs [53.710405006523274]
This work focuses on the representation learning question: how can we learn such features?
Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem.
We develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
arXiv Detail & Related papers (2020-06-18T19:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.