In-Context Algorithm Emulation in Fixed-Weight Transformers
- URL: http://arxiv.org/abs/2508.17550v2
- Date: Fri, 26 Sep 2025 15:04:11 GMT
- Title: In-Context Algorithm Emulation in Fixed-Weight Transformers
- Authors: Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu,
- Abstract summary: A minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting.<n>A single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class via only prompting.
- Score: 13.585357287532588
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function $f: \mathbb{R} \to \mathbb{R}$, we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form $f(w^\top x - y)$ to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
Related papers
- Replicable Constrained Bandits [27.71248958698115]
A emphreplicable online learning algorithm is one that takes the same sequence of decisions across different executions in the same environment, with high probability.<n>We design replicable algorithms whose regret and constraint violation match those of non-replicable ones in terms of $T$.<n>As a key step toward these guarantees, we develop the first replicable UCB-like algorithm for emphunconstrained MABs.
arXiv Detail & Related papers (2026-02-16T09:22:23Z) - Finding Clustering Algorithms in the Transformer Architecture [16.336124248778496]
We show that transformers can implement a fundamental and widely used algorithm for $k$-means clustering: Lloyd's algorithm.<n>We numerically implement this transformer and demonstrate in experiments the exact correspondence between our architecture and Lloyd's algorithm.<n>Our findings offer a clear and interpretable perspective on implementing precise algorithms in transformers.
arXiv Detail & Related papers (2025-06-23T20:52:01Z) - Transformers Meet In-Context Learning: A Universal Approximation Theory [25.513848079509653]
We develop a universal approximation theory to elucidate how transformers enable in-context learning.<n>For a general class of functions, we demonstrate how to construct a transformer that can predict based on a few noisy in-context examples.
arXiv Detail & Related papers (2025-06-05T16:12:51Z) - Towards Learning High-Precision Least Squares Algorithms with Sequence Models [42.217390215093516]
We show that softmax Transformers struggle to perform high-precision multiplications.<n>We identify limitations present in existing architectures and training procedures.<n>For the first time, we demonstrate the ability to train to near machine precision.
arXiv Detail & Related papers (2025-03-15T23:25:11Z) - Algorithmic Language Models with Neurally Compiled Libraries [16.284360949127723]
Large Language Models lack true algorithmic ability.<n>Our paper proposes augmenting LLMs with a library of fundamental operations and sophisticated differentiable programs.<n>We explore the feasability of augmenting LLaMA3 with a differentiable computer.
arXiv Detail & Related papers (2024-07-06T00:27:05Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Accelerated First-Order Optimization under Nonlinear Constraints [61.98523595657983]
We exploit between first-order algorithms for constrained optimization and non-smooth systems to design a new class of accelerated first-order algorithms.<n>An important property of these algorithms is that constraints are expressed in terms of velocities instead of sparse variables.
arXiv Detail & Related papers (2023-02-01T08:50:48Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary.
Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z) - Metalearning: Sparse Variable-Structure Automata [0.0]
We propose a metalearning approach to increase the number of basis vectors used in dynamic sparse coding vectors on the fly.
An actor-critic algorithm is deployed to automatically choose an appropriate dimension for feature regarding the required level of accuracy.
arXiv Detail & Related papers (2021-01-30T21:32:23Z) - Activation Relaxation: A Local Dynamical Approximation to
Backpropagation in the Brain [62.997667081978825]
Activation Relaxation (AR) is motivated by constructing the backpropagation gradient as the equilibrium point of a dynamical system.
Our algorithm converges rapidly and robustly to the correct backpropagation gradients, requires only a single type of computational unit, and can operate on arbitrary computation graphs.
arXiv Detail & Related papers (2020-09-11T11:56:34Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.