Related papers: Block-Recurrent Dynamics in Vision Transformers

Block-Recurrent Dynamics in Vision Transformers

URL: http://arxiv.org/abs/2512.19941v1
Date: Tue, 23 Dec 2025 00:18:23 GMT
Title: Block-Recurrent Dynamics in Vision Transformers
Authors: Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller,
Abstract summary: We argue that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k ll L$ distinct blocks applied recurrently.<n>We train a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost.
Score: 42.261020313952976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

Related papers

PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
Distributed physics-informed neural networks via domain decomposition for fast flow reconstruction [8.614942690565782]
PINs offer a powerful paradigm for flow reconstruction, seamlessly integrating velocity measurements with the governing Navier-N equations to recover complete velocity and latent pressure fields.<n>A critical challenge in such distributed PINs is pressure indeterminacy, where independent sub-networks drift into inconsistent local pressure baselines.<n>By enforcing a unidirectional flow from designated master ranks, our approach eliminates uniqueness and guarantees global pressure while preserving temporal continuity.
arXiv Detail & Related papers (2026-02-05T16:41:55Z)
Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens.<n>We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable "concept basins" to fixed points of this renormalization-like dynamics.<n>The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space.
arXiv Detail & Related papers (2026-01-16T23:11:02Z)
Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core [4.073707521515039]
"Digital metabolism" is a hypothesis suggesting that targeted forgetting is necessary for distilling a pure neural logic core.<n>We introduce the Regenerative Logic-Core Protocol (RLCP), a dual-stream training framework that renders specific factual dependencies linearly undecodable.<n> Empirical analysis on GSM8K reveals that the "metabolized" model spontaneously adopts Symbolic chain-of-thought scaffolding.
arXiv Detail & Related papers (2026-01-15T19:21:16Z)
Deep Delta Learning [91.75868893250662]
We introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection.<n>We provide a spectral analysis of this operator, demonstrating that the gate $(mathbfX)$ enables dynamic between identity mapping, projection, and geometric reflection.<n>This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics.
arXiv Detail & Related papers (2026-01-01T18:11:38Z)
Rethinking Vision Transformer Depth via Structural Reparameterization [16.12815682992294]
We propose a branch-based structural reparameterization technique that operates during the training phase.<n>Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models.<n>When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K.
arXiv Detail & Related papers (2025-11-24T21:28:55Z)
Learning by Steering the Neural Dynamics: A Statistical Mechanics Perspective [0.0]
We study how neural dynamics can support fully local, distributed learning.<n>We propose a biologically plausible algorithm for supervised learning with any binary recurrent network.
arXiv Detail & Related papers (2025-10-13T22:28:34Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z)
Two-Scale Latent Dynamics for Recurrent-Depth Transformers [18.852161704625562]
We study the geometry of current-depth transformers scale test-time compute by iterating latent computations before emitting tokens.<n>Across checkpoints, our measurements show that loop steps become emphsmaller and increasingly emphorthogonal to one another.<n>These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size.
arXiv Detail & Related papers (2025-09-27T14:01:40Z)
Tracing the Representation Geometry of Language Models from Pretraining to Post-training [22.18942718274405]
We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training.<n>We uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining.<n>Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data.
arXiv Detail & Related papers (2025-09-27T00:46:29Z)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
Intensity Profile Projection: A Framework for Continuous-Time Representation Learning for Dynamic Networks [50.2033914945157]
We present a representation learning framework, Intensity Profile Projection, for continuous-time dynamic network data. The framework consists of three stages: estimating pairwise intensity functions, learning a projection which minimises a notion of intensity reconstruction error. Moreoever, we develop estimation theory providing tight control on the error of any estimated trajectory, indicating that the representations could even be used in quite noise-sensitive follow-on analyses.
arXiv Detail & Related papers (2023-06-09T15:38:25Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.