Related papers: LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

URL: http://arxiv.org/abs/2602.11451v1
Date: Wed, 11 Feb 2026 23:58:28 GMT
Title: LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation
Authors: Ahmadreza Jeddi, Marco Ciccone, Babak Taati,
Abstract summary: We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning.<n>Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths.<n>LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints.
Score: 9.943277041891788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.

Related papers

Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without Training [9.617245548268437]
We propose an inference-time inner looping to prolong refinement in pretrained language models.<n>Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements.<n>Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.
arXiv Detail & Related papers (2026-02-16T14:04:24Z)
Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer [65.38883376379812]
We propose the Discrete Transformer, an architecture engineered to bridge the gap between continuous representations and discrete symbolic logic.<n> Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains.
arXiv Detail & Related papers (2026-01-09T12:49:41Z)
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z)
A Formal Comparison Between Chain-of-Thought and Latent Thought [32.84174396586435]
Chain-of-Thought (CoT) elicits reasoning in large language models by explicitly generating intermediate steps in natural language.<n>Latent Thought in looped models operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations.
arXiv Detail & Related papers (2025-09-25T11:27:52Z)
To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers [32.84174396586435]
Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks.<n>We provide a formal analysis of their respective strengths and limitations.
arXiv Detail & Related papers (2025-05-25T17:49:37Z)
Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners [72.37408197157453]
Recent advancements have demonstrated that the performance of large language models (LLMs) can be significantly enhanced by scaling computational resources at test time.<n>This raises a fundamental question: can models with lower complexity leverage their superior generation throughput to outperform similarly sized Transformers for a fixed computational budget?<n>To address this question and overcome the lack of strong subquadratic reasoners, we distill pure and hybrid Mamba models from pretrained Transformers.
arXiv Detail & Related papers (2025-02-27T18:08:16Z)
Reasoning with Latent Thoughts: On the Power of Looped Transformers [52.84192961524481]
We show that for many synthetic reasoning problems, a $k$-layer transformer looped $L$ times nearly matches the performance of a $kL$-layer non-looped model.<n>Our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth.
arXiv Detail & Related papers (2025-02-24T18:49:05Z)
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [47.06427150903487]
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities.<n>Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions.<n>We propose RELAY to better leverage the strengths of Looped Transformers.
arXiv Detail & Related papers (2025-02-12T15:17:04Z)
Loop Neural Networks for Parameter Sharing [1.1049608786515839]
We introduce a novel Loop Neural Network, which achieves better performance by utilizing longer computational time without increasing the model size. Our approach revisits the input multiple times, refining the prediction by iteratively looping over a subset of the model with residual connections. We demonstrate the effectiveness of this method through experiments comparing versions of GPT-2 with our loop models, showing improved performance in language modeling tasks while maintaining similar parameter counts.
arXiv Detail & Related papers (2024-09-21T17:07:42Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.