Related papers: Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium

Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium

URL: http://arxiv.org/abs/2511.21882v1
Date: Wed, 26 Nov 2025 20:02:59 GMT
Title: Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium
Authors: Akbar Anbar Jafari, Gholamreza Anbarjafari,
Abstract summary: We introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium.<n>We instantiate this principle as Equilibrium Transformers, which augment standard transformer layers with an Equilibrium Refinement Module.<n>Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance.
Score: 0.6820746164515952
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Contemporary autoregressive transformers operate in open loop: each hidden state is computed in a single forward pass and never revised, causing errors to propagate uncorrected through the sequence. We identify this open-loop bottleneck as a fundamental architectural limitation underlying well-documented failures in long-range reasoning, factual consistency, and multi-step planning. To address this limitation, we introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium before committing to each token. We instantiate this principle as Equilibrium Transformers (EqT), which augment standard transformer layers with an Equilibrium Refinement Module that minimizes a learned energy function via gradient descent in latent space. The energy function enforces bidirectional prediction consistency, episodic memory coherence, and output confidence, all computed without external supervision. Theoretically, we prove that EqT performs approximate MAP inference in a latent energy-based model, establish linear convergence guarantees, and show that refinement improves predictions precisely on hard instances where one-shot inference is suboptimal. The framework unifies deep equilibrium models, diffusion language models, and test-time training as special cases. Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance, validating that the benefit of deliberation scales with task difficulty. Just as attention mechanisms resolved the sequential bottleneck of recurrent networks, we propose that closed-loop equilibrium may resolve the commitment bottleneck of open-loop autoregression, representing a foundational step toward language models.

Related papers

Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs [55.77845440440496]
Push-based decentralized communication enables optimization over communication networks, where information exchange may be asymmetric.<n>We develop a unified uniform-stability framework for the Gradient Push (SGP) algorithm.<n>A key technical ingredient is an imbalance-aware generalization bound through two quantities.
arXiv Detail & Related papers (2026-02-24T05:32:03Z)
Closing the Loop: A Control-Theoretic Framework for Provably Stable Time Series Forecasting with LLMs [22.486083545585984]
Large Language Models (LLMs) have recently shown exceptional potential in time series forecasting.<n>Existing approaches typically employ a naive autoregressive generation strategy.<n>We propose textbfF-LLM, a novel closed-loop framework.
arXiv Detail & Related papers (2026-02-13T09:35:12Z)
Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives [22.29000001610794]
Standard negative log-likelihood for Supervised Fine-Tuning (SFT) applies uniform token-level weighting.<n>This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident.<n>Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones.<n>We introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the
arXiv Detail & Related papers (2026-02-11T22:56:43Z)
PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z)
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation [8.973965016201822]
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to instability.<n>Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and gradients.
arXiv Detail & Related papers (2025-05-30T08:18:23Z)
Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems [4.2223436389469144]
Langevin dynamics driven by score-based generative models (SGMs) acting as priors, formulated directly in function space.<n>We derive, for the first time, error estimates that explicitly depend on the approximation error of the score.<n>As a consequence, we obtain sufficient conditions for global convergence in Kullback-Leibler divergence on the underlying function space.
arXiv Detail & Related papers (2025-05-23T18:12:04Z)
Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning [16.35681450323654]
Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute.<n>We give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning.<n>Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.
arXiv Detail & Related papers (2025-05-22T17:33:49Z)
Scalable Equilibrium Sampling with Sequential Boltzmann Generators [60.00515282300297]
We extend the Boltzmann generator framework with two key contributions.<n>The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates.<n>In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo.
arXiv Detail & Related papers (2025-02-25T18:59:13Z)
On the Power of Perturbation under Sampling in Solving Extensive-Form Games [56.013335390600524]
We investigate how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in solving extensive-form games under sampling.<n>We present a unified framework for textitPerturbed FTRL algorithms and study two variants: PFTRL-KL and PFTRL-RKL.
arXiv Detail & Related papers (2025-01-28T00:29:38Z)
Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z)
Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.