Related papers: Transformer learns the cross-task prior and regularization for in-context learning

Transformer learns the cross-task prior and regularization for in-context learning

URL: http://arxiv.org/abs/2505.12138v1
Date: Sat, 17 May 2025 20:42:23 GMT
Title: Transformer learns the cross-task prior and regularization for in-context learning
Authors: Fei Lu, Yue Yu,
Abstract summary: We introduce a linear transformer to learn the inverse mapping from contextual examples to the underlying weight vector.<n>We numerically verify that the error of the transformer estimator scales linearly with the noise level, the ratio of task dimension to context length, and the condition number of the input data.
Score: 11.570071580371964
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transformers have shown a remarkable ability for in-context learning (ICL), making predictions based on contextual examples. However, while theoretical analyses have explored this prediction capability, the nature of the inferred context and its utility for downstream predictions remain open questions. This paper aims to address these questions by examining ICL for inverse linear regression (ILR), where context inference can be characterized by unsupervised learning of underlying weight vectors. Focusing on the challenging scenario of rank-deficient inverse problems, where context length is smaller than the number of unknowns in the weight vectors and regularization is necessary, we introduce a linear transformer to learn the inverse mapping from contextual examples to the underlying weight vector. Our findings reveal that the transformer implicitly learns both a prior distribution and an effective regularization strategy, outperforming traditional ridge regression and regularization methods. A key insight is the necessity of low task dimensionality relative to the context length for successful learning. Furthermore, we numerically verify that the error of the transformer estimator scales linearly with the noise level, the ratio of task dimension to context length, and the condition number of the input data. These results not only demonstrate the potential of transformers for solving ill-posed inverse problems, but also provide a new perspective towards understanding the knowledge extraction mechanism within transformers.

Related papers

When Does Context Help? Error Dynamics of Contextual Information in Large Language Models [64.88201012057822]
We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in large language models.<n>Our analysis characterizes contextual influence through output error dynamics.<n> Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy.
arXiv Detail & Related papers (2026-02-09T05:58:41Z)
Quantitative Bounds for Length Generalization in Transformers [58.175107357008876]
We study the problem of length generalization (LG) in transformers.<n>LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences.
arXiv Detail & Related papers (2025-10-30T21:31:36Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Transformers Don't In-Context Learn Least Squares Regression [5.648229654902264]
In-context learning (ICL) has emerged as a powerful capability of large pretrained transformers.<n>We study how transformers implement learning at inference time.<n>We highlight the role of the pretraining corpus in shaping ICL behaviour.
arXiv Detail & Related papers (2025-07-13T01:09:26Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z)
Linear Transformers are Versatile In-Context Learners [19.988368693379087]
We prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm.
arXiv Detail & Related papers (2024-02-21T23:45:57Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems. We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action. We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z)
Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL) We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z)
Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly. We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.