Related papers: Two-Scale Latent Dynamics for Recurrent-Depth Transformers

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

URL: http://arxiv.org/abs/2509.23314v1
Date: Sat, 27 Sep 2025 14:01:40 GMT
Title: Two-Scale Latent Dynamics for Recurrent-Depth Transformers
Authors: Francesco Pappone, Donato Crisostomi, Emanuele Rodolà,
Abstract summary: We study the geometry of current-depth transformers scale test-time compute by iterating latent computations before emitting tokens.<n>Across checkpoints, our measurements show that loop steps become emphsmaller and increasingly emphorthogonal to one another.<n>These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size.
Score: 18.852161704625562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, \emph{two-scale} operational picture: (i) within a looped block, updates act as \emph{small-scale refinements}; (ii) across consecutive blocks, states undergo a \emph{larger-scale drift}. Across checkpoints, our measurements show that loop steps become \emph{smaller} and increasingly \emph{orthogonal} to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.

Related papers

PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds [0.4779196219827507]
We show how cross-entropy training reshapes attention scores and value vectors in a transformer attention head.<n>Our core result is an emphadvantage-based routing law for attention scores.<n>We show that this coupled specialization behaves like a two-timescale EM procedure.
arXiv Detail & Related papers (2025-12-27T05:31:44Z)
Block-Recurrent Dynamics in Vision Transformers [42.261020313952976]
We argue that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k ll L$ distinct blocks applied recurrently.<n>We train a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost.
arXiv Detail & Related papers (2025-12-23T00:18:23Z)
TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z)
OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z)
Tracing the Representation Geometry of Language Models from Pretraining to Post-training [22.18942718274405]
We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training.<n>We uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining.<n>Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data.
arXiv Detail & Related papers (2025-09-27T00:46:29Z)
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z)
Phase transition of \emph{descending} phase retrieval algorithms [0.0]
We study theoretical limits of emphdescending phase retrieval algorithms.<n>We identify the concepts of emphparametric manifold and its emphfunneling points as key mathematical objects.
arXiv Detail & Related papers (2025-06-23T04:10:35Z)
AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction [36.239648954658534]
Time series forecasting requires architectures that simultaneously achieve three competing objectives.<n>We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges.<n> Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on P08.
arXiv Detail & Related papers (2025-06-19T03:47:04Z)
Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight [1.4141453107129398]
We analyze the convergence rate using programming with Polyak's acceleration. We show that the convergence rate can be written as exponential in step-size and momentum weight.
arXiv Detail & Related papers (2024-07-31T04:25:39Z)
Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism.<n>We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning [13.908826484332282]
We study a new two-time-scale gradient method for solving optimization problems. Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale gradient algorithm. We apply our framework to gradient-based policy evaluation algorithms in reinforcement learning.
arXiv Detail & Related papers (2021-09-29T23:15:23Z)
Nesterov Accelerated ADMM for Fast Diffeomorphic Image Registration [63.15453821022452]
Recent developments in approaches based on deep learning have achieved sub-second runtimes for DiffIR. We propose a simple iterative scheme that functionally composes intermediate non-stationary velocity fields. We then propose a convex optimisation model that uses a regularisation term of arbitrary order to impose smoothness on these velocity fields.
arXiv Detail & Related papers (2021-09-26T19:56:45Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.