Related papers: How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

URL: http://arxiv.org/abs/2310.10616v1
Date: Mon, 16 Oct 2023 17:40:49 GMT
Title: How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations
Authors: Tianyu Guo, Wei Hu, Song Mei, Huan Wang, Caiming Xiong, Silvio Savarese, Yu Bai
Abstract summary: This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
Score: 98.7450564309923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with representations. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function, composed with a linear function that differs in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.

Related papers

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers [59.472505916020936]
We investigate how transformers learn to associate information across modalities from in-context examples.<n>We revisit core principles of unimodal ICL in modern transformers.<n>Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars.
arXiv Detail & Related papers (2026-01-28T17:37:28Z)
How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs [8.135786025034626]
Pretrained Transformers demonstrate remarkable in-context learning capabilities, enabling them to adapt to new tasks.<n>We study ICL in pretrained Transformers with nonlinear equivalence heads on nonlinear tasks drawn from multiple data sources.<n>Our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into universality of architecture and data in ICL.
arXiv Detail & Related papers (2025-10-29T17:51:57Z)
Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions [32.71332125930795]
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL)<n>We investigate Markovian function learning through a structured ICL setup to reveal underlying optimization behaviors.
arXiv Detail & Related papers (2025-10-21T13:42:48Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Can Transformers Learn Full Bayesian Inference in Context? [13.479322264788367]
We show that transformers can perform full Bayesian inference for commonly used statistical models in context. We introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows. Experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods.
arXiv Detail & Related papers (2025-01-28T10:04:53Z)
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers [18.077009146950473]
Autoregressive transformers exhibit adaptive learning through in-context learning (ICL) We propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
arXiv Detail & Related papers (2024-12-16T19:00:18Z)
Re-examining learning linear functions in context [1.8843687952462742]
In-context learning (ICL) has emerged as a powerful paradigm for easily adapting Large Language Models (LLMs) to various tasks. We explore a simple model of ICL in a controlled setup with synthetic training data. Our findings challenge the prevailing narrative that transformers adopt algorithmic approaches to learn a linear function in-context.
arXiv Detail & Related papers (2024-11-18T10:58:46Z)
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning. We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z)
Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings [60.698130703909804]
Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset. We propose SQ-Transformer that explicitly encourages systematicity in the embeddings and attention layers. We show that SQ-Transformer achieves stronger compositional generalization than the vanilla Transformer on multiple low-complexity semantic parsing and machine translation datasets.
arXiv Detail & Related papers (2024-02-09T15:53:15Z)
Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data [21.242708937367865]
Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL) This paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture.
arXiv Detail & Related papers (2024-02-01T16:39:45Z)
Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes [39.08988313527199]
In-context learning (ICL) refers to the ability of a model to condition on a few in-context demonstrations to generate the answer for a new query input. Despite the impressive ICL ability of LLMs, ICL in LLMs is sensitive to input demonstrations and limited to short context lengths.
arXiv Detail & Related papers (2023-11-30T02:26:55Z)
Schema-learning and rebinding as mechanisms of in-context learning and emergence [10.370506005311091]
In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs) We demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs)
arXiv Detail & Related papers (2023-06-16T00:29:19Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.