From Growing to Looping: A Unified View of Iterative Computation in LLMs
- URL: http://arxiv.org/abs/2602.16490v1
- Date: Wed, 18 Feb 2026 14:25:16 GMT
- Title: From Growing to Looping: A Unified View of Iterative Computation in LLMs
- Authors: Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer,
- Abstract summary: Looping and depth-grown models exhibit convergent depth-wise signatures.<n>Applying inference-time looping to the middle blocks of a depth-grown model improves accuracy by up to $2times$.<n> depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy mixtures.
- Score: 42.46854018848624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.
Related papers
- Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis [40.72065859626204]
We show that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half.<n>This work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits.
arXiv Detail & Related papers (2025-12-09T17:12:04Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - Do Language Models Use Their Depth Efficiently? [61.0037917291838]
We analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models.<n>We find that layers in the second half contribute much less than those in the first half.<n>For multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults.
arXiv Detail & Related papers (2025-05-20T04:00:56Z) - A Two-Phase Perspective on Deep Learning Dynamics [0.0]
We propose that learning in deep neural networks proceeds in two phases: a rapid curve fitting phase followed by a slower compression or coarse graining phase.<n>We empirically show that the associated timescales align in two rather different settings.<n>We argue that the second phase is not actively optimized by standard training algorithms and may be unnecessarily prolonged.
arXiv Detail & Related papers (2025-04-17T06:57:37Z) - Loop Neural Networks for Parameter Sharing [1.1049608786515839]
We introduce a novel Loop Neural Network, which achieves better performance by utilizing longer computational time without increasing the model size.
Our approach revisits the input multiple times, refining the prediction by iteratively looping over a subset of the model with residual connections.
We demonstrate the effectiveness of this method through experiments comparing versions of GPT-2 with our loop models, showing improved performance in language modeling tasks while maintaining similar parameter counts.
arXiv Detail & Related papers (2024-09-21T17:07:42Z) - A Deep Dive into the Connections Between the Renormalization Group and
Deep Learning in the Ising Model [0.0]
Renormalization group (RG) is an essential technique in statistical physics and quantum field theory.
We develop extensive renormalization techniques for the 1D and 2D Ising model to provide a baseline for comparison.
For the 2D Ising model, we successfully generated Ising model samples using the Wolff algorithm, and performed the group flow using a quasi-deterministic method.
arXiv Detail & Related papers (2023-08-21T22:50:54Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z) - Exploiting Correspondences with All-pairs Correlations for Multi-view
Depth Estimation [19.647670347925754]
Multi-view depth estimation plays a critical role in reconstructing and understanding the 3D world.
We design a novel iterative multi-view depth estimation framework mimicking the optimization process.
We conduct sufficient experiments on ScanNet, DeMoN, ETH3D, and 7Scenes to demonstrate the superiority of our method.
arXiv Detail & Related papers (2022-05-05T07:38:31Z) - Intermediate Layer Optimization for Inverse Problems using Deep
Generative Models [86.29330440222199]
ILO is a novel optimization algorithm for solving inverse problems with deep generative models.
We empirically show that our approach outperforms state-of-the-art methods introduced in StyleGAN-2 and PULSE for a wide range of inverse problems.
arXiv Detail & Related papers (2021-02-15T06:52:22Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.