Related papers: Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium

Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium

URL: http://arxiv.org/abs/2508.21186v1
Date: Thu, 28 Aug 2025 20:00:22 GMT
Title: Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium
Authors: Christopher R. Lee-Jenkins,
Abstract summary: Decoding in large language models is often described as scoring tokens and normalizing with softmax.<n>We give a self-contained hallucination of this step as a constrained variational principle on the probability simplex.<n>We prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decoding in large language models is often described as scoring tokens and normalizing with softmax. We give a minimal, self-contained account of this step as a constrained variational principle on the probability simplex. The discrete, normalization-respecting ascent is the classical multiplicative-weights (entropic mirror) update; its continuous-time limit is the replicator flow. From these ingredients we prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium. This formalizes the common ``manifold traversal'' intuition at the output-distribution level. The analysis yields precise, practice-facing consequences: temperature acts as an exact rescaling of time along the same trajectory, while top-k and nucleus sampling restrict the flow to a face with identical guarantees. We also outline a controlled account of path-dependent score adjustments and their connection to loop-like, hallucination-style behavior. We make no claims about training dynamics or internal representations; those are deferred to future work.

Related papers

Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting [6.369253528507392]
Iterative self-training repeatedly refits a model on pseudo-labels generated by its own predictions.<n>We derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations.
arXiv Detail & Related papers (2026-02-15T07:28:12Z)
Clustering in Deep Stochastic Transformers [10.988655177671255]
Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point.<n>We analyze deep Transformers where noise arises from the random value of value.<n>For two tokens, we prove a phase transition governed by the interaction strength and the token dimension.
arXiv Detail & Related papers (2026-01-29T16:28:13Z)
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds [0.4779196219827507]
We show how cross-entropy training reshapes attention scores and value vectors in a transformer attention head.<n>Our core result is an emphadvantage-based routing law for attention scores.<n>We show that this coupled specialization behaves like a two-timescale EM procedure.
arXiv Detail & Related papers (2025-12-27T05:31:44Z)
Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium [0.6820746164515952]
We introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium.<n>We instantiate this principle as Equilibrium Transformers, which augment standard transformer layers with an Equilibrium Refinement Module.<n>Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance.
arXiv Detail & Related papers (2025-11-26T20:02:59Z)
On the flow matching interpretability [2.816392009888047]
We propose a framework constraining each flow step to be sampled from a known physical distribution.<n>Flow trajectories are mapped to (and constrained to traverse) the equilibrium states of the simulated physical process.<n>This demonstrates that embedding physical semantics into generative flows transforms neural trajectories into interpretable physical processes.
arXiv Detail & Related papers (2025-10-24T07:26:45Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
Quantum Rabi oscillations in the semiclassical limit: backreaction on the cavity field and entanglement [89.99666725996975]
We show that for a strong atom-field coupling, when the duration of the $pi $pulse is below $100omega -1$, the behaviour of the atomic excitation probability deviates significantly from the textbook.<n>In the rest of this work we study numerically the backreaction of the qubit on the cavity field and the resulting atom-field entanglement.
arXiv Detail & Related papers (2025-04-12T23:24:59Z)
Improving Consistency Models with Generator-Augmented Flows [16.049476783301724]
Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network.<n>They can be learned in two ways: consistency distillation and consistency training.<n>We propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model.
arXiv Detail & Related papers (2024-06-13T20:22:38Z)
Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z)
Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory. We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability. It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Distribution of Kinks in an Ising Ferromagnet After Annealing and the Generalized Kibble-Zurek Mechanism [0.8258451067861933]
We consider a one-dimensional Isingmagnet induced by a temperature quench in finite time. The universal power-law scaling of cumulants is corroborated by numerical simulations based on Glauber dynamics. We consider linear, nonlinear, and exponential cooling schedules, among which the latter provides the most efficient shortcuts to cooling in a given time.
arXiv Detail & Related papers (2021-05-19T13:58:33Z)
Decoherent Quench Dynamics across Quantum Phase Transitions [0.0]
We formulate decoherent dynamics induced by continuous quantum non-demolition measurements of the instantaneous Hamiltonian. We generalize the well-studied universal Kibble-Zurek behavior for linear temporal drive across the critical point. We show that the freeze-out time scale can be probed from the relaxation of the Hall conductivity.
arXiv Detail & Related papers (2021-03-14T23:43:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.