Related papers: Understanding Dynamic Compute Allocation in Recurrent Transformers

Understanding Dynamic Compute Allocation in Recurrent Transformers

URL: http://arxiv.org/abs/2602.08864v1
Date: Mon, 09 Feb 2026 16:27:52 GMT
Title: Understanding Dynamic Compute Allocation in Recurrent Transformers
Authors: Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin,
Abstract summary: Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones.<n>Prior work is primarily evaluated on natural-token benchmarks using task-level metrics.<n>We introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty.
Score: 23.760167933957707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

Related papers

Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
Scalable Bayesian Network Structure Learning Using Tsetlin Machine to Constrain the Search Space [10.753354249346073]
The PC algorithm is a widely used method in causal inference for learning the structure of Bayesian networks.<n>Despite its popularity, the PC algorithm suffers from significant time complexity, particularly as the size of the dataset increases.<n>We propose a novel approach that utilise the Tsetlin Machine (TM) to construct Bayesian structures more efficiently.
arXiv Detail & Related papers (2025-11-24T16:23:19Z)
Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization [8.236500918322138]
We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework to define and measure reasoning.<n>A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity exceeds that of all training examples.<n>We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack.
arXiv Detail & Related papers (2025-10-06T13:08:31Z)
Unlocking Symbol-Level Precoding Efficiency Through Tensor Equivariant Neural Network [84.22115118596741]
We propose an end-to-end deep learning (DL) framework with low inference complexity for symbol-level precoding.<n>We show that the proposed framework captures substantial performance gains of optimal SLP, while achieving an approximately 80-times speedup over conventional methods.
arXiv Detail & Related papers (2025-10-02T15:15:50Z)
Computational Algebra with Attention: Transformer Oracles for Border Basis Algorithms [22.546453748805025]
We design and train a Transformer-based oracle that identifies and eliminates computationally expensive reduction steps.<n>We achieve substantial speedup factors of up to 3.5x compared to the base algorithm.<n>Our learning approach is data efficient, stable, and a practical enhancement to traditional computer algebra algorithms and symbolic computation.
arXiv Detail & Related papers (2025-05-29T17:35:25Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Predicting Probabilities of Error to Combine Quantization and Early Exiting: QuEE [68.6018458996143]
We propose a more general dynamic network that can combine both quantization and early exit dynamic network: QuEE. Our algorithm can be seen as a form of soft early exiting or input-dependent compression. The crucial factor of our approach is accurate prediction of the potential accuracy improvement achievable through further computation.
arXiv Detail & Related papers (2024-06-20T15:25:13Z)
The Complexity of Optimizing Atomic Congestion [14.845310803203724]
Atomic congestion games are a classic topic in network design, routing, and algorithmic game theory. We show that the problem remains highly intractable even on extremely simple networks. We conclude by extending our analysis towards the (even more challenging) min-max variant of the problem.
arXiv Detail & Related papers (2023-12-15T21:31:30Z)
Counting and Algorithmic Generalization with Transformers [0.0]
We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance. We demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting.
arXiv Detail & Related papers (2023-10-12T18:39:24Z)
A Boosting Approach to Reinforcement Learning [59.46285581748018]
We study efficient algorithms for reinforcement learning in decision processes whose complexity is independent of the number of states. We give an efficient algorithm that is capable of improving the accuracy of such weak learning methods.
arXiv Detail & Related papers (2021-08-22T16:00:45Z)
Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity [58.70807593332932]
We study oracle complexity of gradient based methods for approximation problems. We focus on instance-dependent complexity instead of worst case complexity. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation.
arXiv Detail & Related papers (2020-06-08T09:25:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.