Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
- URL: http://arxiv.org/abs/2602.12204v1
- Date: Thu, 12 Feb 2026 17:40:15 GMT
- Title: Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
- Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma,
- Abstract summary: Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs.<n>We present a surprising finding from analyzing GPT-2 models: textbf88% of attention operations retrieve information already predictable from the model's hidden state.<n>We introduce textbfours (textbfConsolidation-based textbfRouting for textbfAdaptive textbfMemory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory
- Score: 6.908972852063454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $Ω(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].
Related papers
- Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning [25.852162778115808]
Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning.<n>We analyze and harness the model's tendency to restate the question, which we term the emphEcho of Prompt (EOP), as a front-loaded, compute-shaping mechanism.
arXiv Detail & Related papers (2026-02-06T10:53:26Z) - Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability [1.078600700827543]
We build a simple model-agnostic witness of training memory based on emphback-flow of distinguishability.<n>We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, and more micro-steps.<n>We position this as a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization.
arXiv Detail & Related papers (2026-01-23T09:03:25Z) - Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z) - GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory [7.180426235884756]
GatedFWA is a memory-underlineGated (underlineFlash) underlineWindowed underlineAttention mechanism.<n>It stabilizes memory updates and makes gradient flow controllable.<n>On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead.
arXiv Detail & Related papers (2025-12-08T18:11:06Z) - Sparse Attention Post-Training for Mechanistic Interpretability [55.030850996535776]
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance.<n>Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 % of its edges.
arXiv Detail & Related papers (2025-12-05T16:40:08Z) - QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification [67.15451442018258]
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment.<n>Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression.<n>We propose textbfQuantSparse, a unified framework that integrates model quantization with attention sparsification.
arXiv Detail & Related papers (2025-09-28T06:49:44Z) - Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [57.514786046966265]
We propose textbfPerturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z) - CODA: Repurposing Continuous VAEs for Discrete Tokenization [31.932323809073477]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization.<n>Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z) - Vision Transformer with Sparse Scan Prior [24.78780746169092]
We propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism ($rmS3rmA$)<n>This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.<n>Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z) - Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks.
We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads.
Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation.
We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels.
STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.