Related papers: Gated Differentiable Working Memory for Long-Context Language Modeling

Gated Differentiable Working Memory for Long-Context Language Modeling

URL: http://arxiv.org/abs/2601.12906v1
Date: Mon, 19 Jan 2026 10:00:33 GMT
Title: Gated Differentiable Working Memory for Long-Context Language Modeling
Authors: Lingrui Mei, Shenghua Liu, Yiwei Wang, Yuyao Ge, Baolong Bi, Jiayu Yao, Jun Wan, Ziling Yin, Jiafeng Guo, Xueqi Cheng,
Abstract summary: We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process.<n>Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$times$ fewer gradient steps than uniform baselines.
Score: 80.27483324685434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory -- transient parameters updated on the current context -- but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, and allocates gradient steps accordingly while maintaining global coverage. Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$\times$ fewer gradient steps than uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.

Related papers

TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs [8.818252253980985]
TempoNet is a reinforcement learning scheduler that pairs a permutation-invariant Transformer with a deep Q-approximation.<n>A latency-aware sparse attention stack with blockwise top-k selection and locality-sensitive chunking enables global reasoning over unordered task sets.
arXiv Detail & Related papers (2026-02-20T09:56:23Z)
Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z)
Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models [16.03043781097689]
Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa)<n>TaTa leverages Brownian Distance Covariance to dynamically adapt vision-language models to new domains without training or back-propagation.<n>Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.
arXiv Detail & Related papers (2026-01-30T18:21:45Z)
AMA: Adaptive Memory via Multi-Agent Collaboration [54.490349689939166]
We propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities.<n>AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods.
arXiv Detail & Related papers (2026-01-28T08:09:49Z)
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression [53.48692193399171]
Gated KalmaNet (GKA) is a layer that reduces the gap by accounting for the full past when predicting the next token.<n>We solve an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.<n>On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
arXiv Detail & Related papers (2025-11-26T03:26:37Z)
PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching [51.98089287914147]
textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.<n>Inspired by the two-stage decision-making process in humans, we propose a textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.
arXiv Detail & Related papers (2025-10-23T03:52:39Z)
ResFormer: All-Time Reservoir Memory for Long Sequence Classification [4.298381633106637]
Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification.<n> Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity.<n>We propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology.
arXiv Detail & Related papers (2025-09-28T21:20:49Z)
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention [53.22963042513293]
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs.<n>We first propose dual-state linear attention (A), a novel design that maintains two hidden states-one for preserving historical context and one for tracking recencythereby mitigating the short-range bias typical of linear-attention architectures.<n>We introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers DSLA layers at inference time, guided by a sensitivity-based layer ordering.
arXiv Detail & Related papers (2025-06-11T01:25:06Z)
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers [58.98923344096319]
REFORM is a novel inference framework that efficiently handles long contexts through a two-phase approach.<n>It achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length.<n>It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains.
arXiv Detail & Related papers (2025-06-01T23:49:14Z)
ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains [25.075869018443925]
ReservoirTTA is a novel plug-in framework designed for prolonged test-time adaptation.<n>At its core, ReservoirTTA maintains a reservoir of domain-specialized models.<n>Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse.
arXiv Detail & Related papers (2025-05-20T15:39:20Z)
Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction [0.0]
Token dependencies degrade as sequence length increases, leading to a decline in coherence and factual consistency.<n>A structured approach is introduced to mitigate this issue through the reweaving of latent states captured at different processing layers.<n>The proposed Contextual Memory Reweaving framework incorporates a Layered Latent State Reconstruction mechanism.
arXiv Detail & Related papers (2025-02-04T06:25:20Z)
On the Convergence of Continual Federated Learning Using Incrementally Aggregated Gradients [7.226144684379189]
The holy grail of machine learning is to enable Continual Federated Learning (CFL) to enhance the efficiency, privacy, and scalability of AI systems while learning from streaming data.<n>We propose a novel replay-memory based federated strategy consisting of edge-based gradient updates on memory and aggregated gradients on the current data.<n>We empirically show that C-FLAG outperforms several state-of-the-art baselines on both task and class-incremental settings with respect to metrics such as accuracy and forgetting.
arXiv Detail & Related papers (2024-11-12T17:36:20Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.