Related papers: Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

URL: http://arxiv.org/abs/2505.14808v1
Date: Tue, 20 May 2025 18:15:49 GMT
Title: Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
Authors: Soo Min Kwon, Alec S. Xu, Can Yaras, Laura Balzano, Qing Qu,
Abstract summary: We demystify the out-of-distribution capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices.<n>We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts.<n>This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training.
Score: 9.249642973141107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the training and testing covariance matrices. We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts. However, using this framework, we also prove an interesting property of ICL: when trained on task vectors drawn from a union of low-dimensional subspaces, ICL can generalize to any subspace within their span, given sufficiently long prompt lengths. This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training. We empirically show that our results also hold for models such as GPT-2, and conclude with (i) experiments on how our observations extend to nonlinear function classes and (ii) results on how LoRA has the ability to capture distribution shifts.

Related papers

Bilinear Convolution Decomposition for Causal RL Interpretability [0.0]
Efforts to interpret reinforcement learning (RL) models often rely on high-level techniques such as attribution or probing.<n>This work proposes replacing nonlinearities in convolutional neural networks (ConvNets) with bilinear variants, to produce a class of models for which these limitations can be addressed.<n>We show bilinear model variants perform comparably in model-free reinforcement learning settings, and give a side by side comparison on ProcGen environments.
arXiv Detail & Related papers (2024-12-01T19:32:04Z)
Can In-context Learning Really Generalize to Out-of-distribution Tasks? [36.11431280689549]
We investigate the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training.<n>We reveal that Transformers may struggle to learn OOD task functions through ICL.
arXiv Detail & Related papers (2024-10-13T02:10:26Z)
Continual Vision-Language Representation Learning with Off-Diagonal Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly. The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z)
Implicit variance regularization in non-contrastive SSL [7.573586022424398]
We analytically study learning dynamics in conjunction with Euclidean and cosine similarity in the eigenspace of closed-form linear predictor networks. We propose a family of isotropic loss functions (IsoLoss) that equalize convergence rates across eigenmodes.
arXiv Detail & Related papers (2022-12-09T13:56:42Z)
Stochastic Mirror Descent in Average Ensemble Models [38.38572705720122]
The mirror descent (SMD) is a general class of training algorithms, which includes the celebrated gradient descent (SGD) as a special case. In this paper we explore the performance of the mirror potential algorithm on mean-field ensemble models.
arXiv Detail & Related papers (2022-10-27T11:04:00Z)
Function Classes for Identifiable Nonlinear Independent Component Analysis [10.828616610785524]
Unsupervised learning of latent variable models (LVMs) is widely used to represent data in machine learning. Recent work suggests that constraining the function class of such models may promote identifiability. We prove that a subclass of these transformations, conformal maps, is identifiable and provide novel theoretical results.
arXiv Detail & Related papers (2022-08-12T17:58:31Z)
Supervised learning of sheared distributions using linearized optimal transport [64.53761005509386]
In this paper we study supervised learning tasks on the space of probability measures. We approach this problem by embedding the space of probability measures into $L2$ spaces using the optimal transport framework. Regular machine learning techniques are used to achieve linear separability.
arXiv Detail & Related papers (2022-01-25T19:19:59Z)
Meta Learning MDPs with Linear Transition Models [22.508479528847634]
We study meta-learning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting. We propose BUC-MatrixRL, a version of the UC-Matrix RL algorithm, and show it can meaningfully leverage a set of sampled training tasks. We prove that compared to learning the tasks in isolation, BUC-Matrix RL provides significant improvements in the transfer regret for high bias low variance task distributions.
arXiv Detail & Related papers (2022-01-21T14:57:03Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)
Learning Invariant Representations and Risks for Semi-supervised Domain Adaptation [109.73983088432364]
We propose the first method that aims to simultaneously learn invariant representations and risks under the setting of semi-supervised domain adaptation (Semi-DA) We introduce the LIRR algorithm for jointly textbfLearning textbfInvariant textbfRepresentations and textbfRisks.
arXiv Detail & Related papers (2020-10-09T15:42:35Z)
FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs [53.710405006523274]
This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. We develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
arXiv Detail & Related papers (2020-06-18T19:11:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.