In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly
- URL: http://arxiv.org/abs/2506.19351v1
- Date: Tue, 24 Jun 2025 06:33:00 GMT
- Title: In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly
- Authors: Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis,
- Abstract summary: In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates.<n>This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones.
- Score: 25.47694115798524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity environments, practical language models encounter tasks spanning diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design well-controlled testbeds based on Markov chains and linear regression that reveal transformers not only identify the appropriate complexity level for each task but also accurately infer the corresponding parameters--even when the in-context examples are compatible with multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam's razor by balancing model fit against complexity penalties. We further ablate on the roles of model size, training mixture distribution, inference context length, and architecture. Finally, we validate this Occam's razor-like inductive bias on a pretrained GPT-4 model with Boolean-function tasks as case study, suggesting it may be inherent to transformers trained on diverse task distributions.
Related papers
- Sample Complexity and Representation Ability of Test-time Scaling Paradigms [91.34339030453425]
Test-time scaling paradigms have advanced the capabilities of large language models (LLMs) on complex tasks.<n>We study the sample efficiency of various test-time strategies, such as self-consistency, best-of-$n$, and self-correction.<n>A single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query.
arXiv Detail & Related papers (2025-06-05T17:48:19Z) - Learning Compositional Functions with Transformers from Easy-to-Hard Data [63.96562216704653]
We study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations.<n>We show that this function class can be efficiently learned, with runtime and sample in $k$, by gradient descent on an $O(log k)$-depth transformer.
arXiv Detail & Related papers (2025-05-29T17:22:00Z) - Context-Scaling versus Task-Scaling in In-Context Learning [17.36757113301424]
We analyze two key components of In-Context Learning (ICL): context-scaling and task-scaling.
While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling.
arXiv Detail & Related papers (2024-10-16T17:58:08Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Divide et Impera: Multi-Transformer Architectures for Complex NLP-Tasks [44.99833362998488]
We present an approach in which complex tasks are divided into simpler subtasks.
Multiple transformer models are fine-tuned to one subtask each, and lined up to accomplish the complex task.
This simplifies the compilation of fine-tuning datasets and increases overall controllability.
arXiv Detail & Related papers (2023-10-25T18:00:15Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL)
We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained
Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states.
We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs.
Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.