Related papers: MeSH: Memory-as-State-Highways for Recursive Transformers

MeSH: Memory-as-State-Highways for Recursive Transformers

URL: http://arxiv.org/abs/2510.07739v1
Date: Thu, 09 Oct 2025 03:23:38 GMT
Title: MeSH: Memory-as-State-Highways for Recursive Transformers
Authors: Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, Jiaang Li, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng,
Abstract summary: Recursive models with fewer parameters often lag behind non-recursive counterparts under matched compute.<n>By probing hidden states, we trace this performance gap to two primary bottlenecks.<n>We introduce a Memory-as-State-Highways scheme, which externalizes state management into an explicit memory buffer.
Score: 23.995570647573484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.

Related papers

Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion [24.26069897783496]
SpiralFormer is a looped Transformer that executes recurrence under a multi-resolution recursion schedule.<n>We show that SpiralFormer achieves better parameter and compute efficiency than both looped and non-looped baselines across model scales from 160M to 1.4B.
arXiv Detail & Related papers (2026-02-12T08:23:21Z)
TS-Memory: Plug-and-Play Memory for Time Series Foundation Models [63.21390142212087]
Time Series Foundation Models (TSFMs) achieve strong zero-shot forecasting through large-scale pre-training.<n>Existing solutions face a trade-off: Parametric Adaptation can cause catastrophic forgetting, while Non-Parametric Retrieval improves forecasts but incurs high latency due to datastore search.<n>We propose Parametric Memory Distillation and implement it as TS-Memory, a lightweight memory adapter that augments frozen TSFMs.
arXiv Detail & Related papers (2026-02-12T04:16:19Z)
Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models [63.47909317137073]
Large Multimodal Models (LMMs) have achieved remarkable success in vision-language computation tasks.<n>But their vast parameter counts are often underutilized during both training and inference.<n>We propose RecursiveVLM, a recursive Transformer architecture tailored for LMMs.
arXiv Detail & Related papers (2026-02-09T17:58:23Z)
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation [50.001816497407475]
We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer.<n>MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking.<n>We also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint.
arXiv Detail & Related papers (2025-07-14T17:49:00Z)
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking [51.154226183713405]
We propose Inner Thinking Transformer, which reimagines layer computations as implicit thinking steps.<n>ITT achieves 96.5% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2%, and outperforms Transformer/Loop variants in 11 benchmarks.
arXiv Detail & Related papers (2025-02-19T16:02:23Z)
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [38.30350849992281]
"Recursive" language models share parameters across layers with minimal loss of performance.<n>Recursive Transformers are efficiently from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop.<n>We show that our models outperform both similar-sized vanilla pretrained models and knowledge distillation baselines.
arXiv Detail & Related papers (2024-10-28T02:15:45Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences.<n>PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads.<n>Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z)
Sliced Recursive Transformer [23.899076070924153]
Recursive operation on vision transformers can improve parameter utilization without involving additional parameters. Our model Sliced Recursive Transformer (SReT) is compatible with a broad range of other designs for efficient vision transformers.
arXiv Detail & Related papers (2021-11-09T17:59:14Z)
ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation [106.04777600352743]
Differentiable architecture search (DARTS) is largely hindered by its substantial memory cost since the entire supernet resides in the memory. The single-path DARTS comes in, which only chooses a single-path submodel at each step. While being memory-friendly, it also comes with low computational costs. We propose a new algorithm called RObustifying Memory-Efficient NAS (ROME) to give a cure.
arXiv Detail & Related papers (2020-11-23T06:34:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.