Related papers: Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training

Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training

URL: http://arxiv.org/abs/2602.14759v1
Date: Mon, 16 Feb 2026 14:04:24 GMT
Title: Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training
Authors: Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene,
Abstract summary: We propose an inference-time inner looping to prolong refinement in pretrained language models.<n>Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements.<n>Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.
Score: 9.617245548268437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep Learning architectures, and in particular Transformers, are conventionally viewed as a composition of layers. These layers are actually often obtained as the sum of two contributions: a residual path that copies the input and the output of a Transformer block. As a consequence, the inner representations (i.e. the input of these blocks) can be interpreted as iterative refinement of a propagated latent representation. Under this lens, many works suggest that the inner space is shared across layers, meaning that tokens can be decoded at early stages. Mechanistic interpretability even goes further by conjecturing that some layers act as refinement layers. Following this path, we propose inference-time inner looping, which prolongs refinement in pretrained off-the-shelf language models by repeatedly re-applying a selected block range. Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements. Analyses of the resulting latent trajectories suggest more stable state evolution and continued semantic refinement. Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.

Related papers

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation [9.943277041891788]
We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning.<n>Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths.<n>LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints.
arXiv Detail & Related papers (2026-02-11T23:58:28Z)
Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer [65.38883376379812]
We propose the Discrete Transformer, an architecture engineered to bridge the gap between continuous representations and discrete symbolic logic.<n> Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains.
arXiv Detail & Related papers (2026-01-09T12:49:41Z)
Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer [0.8738725605667471]
Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning.<n>In standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency.<n>We investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count.
arXiv Detail & Related papers (2025-07-02T23:35:21Z)
Intra-Layer Recurrence in Transformers for Language Modeling [0.03320194947871346]
Intra-Layer Recurrence (ILR) is a more targeted approach that applies recurrence selectively to individual layers within a single forward pass.<n>Our experiments show that allocating more iterations to earlier layers yields optimal results.
arXiv Detail & Related papers (2025-05-03T16:16:55Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Recurrent Generic Contour-based Instance Segmentation with Progressive Learning [111.31166268300817]
We propose a novel deep network architecture, i.e., PolySnake, for generic contour-based instance segmentation. Motivated by the classic Snake algorithm, the proposed PolySnake achieves superior and robust segmentation performance.
arXiv Detail & Related papers (2023-01-21T05:34:29Z)
Object Representations as Fixed Points: Training Iterative Refinement Algorithms with Implicit Differentiation [88.14365009076907]
Iterative refinement is a useful paradigm for representation learning. We develop an implicit differentiation approach that improves the stability and tractability of training.
arXiv Detail & Related papers (2022-07-02T10:00:35Z)
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning [85.95599675484341]
Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations. Transformers have little inductive bias towards learning temporally compressed representations.
arXiv Detail & Related papers (2022-05-30T00:12:33Z)
Phase Collapse in Neural Networks [1.8620637029128544]
Deep convolutional image classifiers progressively transform the spatial variability into a smaller number of channels, which linearly separates all classes. This paper demonstrates that it is a different phase collapse mechanism which explains the ability to progressively eliminate spatial variability. It is justified by explaining how iterated phase collapses progressively improve separation of class means, as opposed to thresholding non-linearities.
arXiv Detail & Related papers (2021-10-11T13:58:01Z)
Iterative Decoding for Compositional Generalization in Transformers [5.269770493488338]
In sequence-to-sequence learning, transformers are often unable to predict correct outputs for even marginally longer examples. This paper introduces iterative decoding, an alternative to seq2seq learning. We show that transfomers trained via iterative decoding outperform their seq2seq counterparts on the PCFG dataset.
arXiv Detail & Related papers (2021-10-08T14:52:25Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.