Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- URL: http://arxiv.org/abs/2507.10524v3
- Date: Sat, 25 Oct 2025 14:12:56 GMT
- Title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun,
- Abstract summary: We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer.<n>MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking.<n>We also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to further decrease memory footprint.
- Score: 61.67090981767583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to further decrease memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
Related papers
- Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models [63.47909317137073]
Large Multimodal Models (LMMs) have achieved remarkable success in vision-language computation tasks.<n>But their vast parameter counts are often underutilized during both training and inference.<n>We propose RecursiveVLM, a recursive Transformer architecture tailored for LMMs.
arXiv Detail & Related papers (2026-02-09T17:58:23Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse [45.255254030425846]
We propose VersatileFFN, a novel feed-forward network that enables flexible reuse of parameters in both width and depth dimensions.<n>A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens.<n> Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method.
arXiv Detail & Related papers (2025-12-16T16:08:23Z) - MeSH: Memory-as-State-Highways for Recursive Transformers [23.995570647573484]
Recursive models with fewer parameters often lag behind non-recursive counterparts under matched compute.<n>By probing hidden states, we trace this performance gap to two primary bottlenecks.<n>We introduce a Memory-as-State-Highways scheme, which externalizes state management into an explicit memory buffer.
arXiv Detail & Related papers (2025-10-09T03:23:38Z) - AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling [43.69519440553312]
Autoregressive Block-Based Iterative generalization achieves better perplexity than a standard Transformer.<n>AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol.
arXiv Detail & Related papers (2025-07-11T13:11:11Z) - Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z) - MOGNET: A Mux-residual quantized Network leveraging Online-Generated weights [2.7036595757881323]
MOGNET is a compact model architecture compatible with resource-limited hardware.<n>It can achieve higher accuracy with a clear gap up to 1% at a similar or even lower model size.
arXiv Detail & Related papers (2025-01-16T13:30:20Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers [36.51973134478652]
Mixture of Depths (MoD) dynamically adjust the computational depth by skipping less important layers.<n>MoD approaches face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed.<n>We propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training.<n>For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths.
arXiv Detail & Related papers (2024-10-17T03:23:50Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - Less is KEN: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models [1.5807079236265718]
KEN is a straightforward, universal and unstructured pruning algorithm based on Kernel Density Estimation (KDE)
Ken aims to construct optimized transformers by selectively preserving the most significant parameters while restoring others to their pre-training state.
Ken achieves equal or better performance than their original unpruned versions, with a minimum parameter reduction of 25%.
arXiv Detail & Related papers (2024-02-05T16:11:43Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - SD-Conv: Towards the Parameter-Efficiency of Dynamic Convolution [16.56592303409295]
Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase.
We propose a new framework, textbfSparse Dynamic Convolution (textscSD-Conv), to naturally integrate these two paths.
arXiv Detail & Related papers (2022-04-05T14:03:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.