Related papers: Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

URL: http://arxiv.org/abs/2602.12204v1
Date: Thu, 12 Feb 2026 17:40:15 GMT
Title: Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma,
Abstract summary: Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs.<n>We present a surprising finding from analyzing GPT-2 models: textbf88% of attention operations retrieve information already predictable from the model's hidden state.<n>We introduce textbfours (textbfConsolidation-based textbfRouting for textbfAdaptive textbfMemory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory
Score: 6.908972852063454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $Ω(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].

Related papers

Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning [25.852162778115808]
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning.<n>We analyze and harness the model's tendency to restate the question, which we term the emphEcho of Prompt (EOP), as a front-loaded, compute-shaping mechanism.
arXiv Detail & Related papers (2026-02-06T10:53:26Z)
Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability [1.078600700827543]
We build a simple model-agnostic witness of training memory based on emphback-flow of distinguishability.<n>We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, and more micro-steps.<n>We position this as a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization.
arXiv Detail & Related papers (2026-01-23T09:03:25Z)
Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z)
GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory [7.180426235884756]
GatedFWA is a memory-underlineGated (underlineFlash) underlineWindowed underlineAttention mechanism.<n>It stabilizes memory updates and makes gradient flow controllable.<n>On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead.
arXiv Detail & Related papers (2025-12-08T18:11:06Z)
Sparse Attention Post-Training for Mechanistic Interpretability [55.030850996535776]
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance.<n>Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 % of its edges.
arXiv Detail & Related papers (2025-12-05T16:40:08Z)
QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification [67.15451442018258]
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment.<n>Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression.<n>We propose textbfQuantSparse, a unified framework that integrates model quantization with attention sparsification.
arXiv Detail & Related papers (2025-09-28T06:49:44Z)
Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [57.514786046966265]
We propose textbfPerturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z)
CODA: Repurposing Continuous VAEs for Discrete Tokenization [31.932323809073477]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization.<n>Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z)
Vision Transformer with Sparse Scan Prior [24.78780746169092]
We propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism ($rmS3rmA$)<n>This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.<n>Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z)
Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.