Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- URL: http://arxiv.org/abs/2507.10524v2
- Date: Mon, 21 Jul 2025 07:45:14 GMT
- Title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
- Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun,
- Abstract summary: We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer.<n>MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking.<n>We also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint.
- Score: 50.001816497407475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
Related papers
- AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling [43.69519440553312]
Autoregressive Block-Based Iterative generalization achieves better perplexity than a standard Transformer.<n>AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol.
arXiv Detail & Related papers (2025-07-11T13:11:11Z) - Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z) - MOGNET: A Mux-residual quantized Network leveraging Online-Generated weights [2.7036595757881323]
MOGNET is a compact model architecture compatible with resource-limited hardware.<n>It can achieve higher accuracy with a clear gap up to 1% at a similar or even lower model size.
arXiv Detail & Related papers (2025-01-16T13:30:20Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers [36.51973134478652]
Mixture of Depths (MoD) dynamically adjust the computational depth by skipping less important layers.<n>MoD approaches face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed.<n>We propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training.<n>For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths.
arXiv Detail & Related papers (2024-10-17T03:23:50Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - Less is KEN: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models [1.5807079236265718]
KEN is a straightforward, universal and unstructured pruning algorithm based on Kernel Density Estimation (KDE)
Ken aims to construct optimized transformers by selectively preserving the most significant parameters while restoring others to their pre-training state.
Ken achieves equal or better performance than their original unpruned versions, with a minimum parameter reduction of 25%.
arXiv Detail & Related papers (2024-02-05T16:11:43Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - SD-Conv: Towards the Parameter-Efficiency of Dynamic Convolution [16.56592303409295]
Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase.
We propose a new framework, textbfSparse Dynamic Convolution (textscSD-Conv), to naturally integrate these two paths.
arXiv Detail & Related papers (2022-04-05T14:03:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.