Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
- URL: http://arxiv.org/abs/2511.21016v1
- Date: Wed, 26 Nov 2025 03:26:37 GMT
- Title: Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
- Authors: Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto,
- Abstract summary: Gated KalmaNet (GKA) is a layer that reduces the gap by accounting for the full past when predicting the next token.<n>We solve an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.<n>On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
- Score: 53.48692193399171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
Related papers
- Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking [7.437238821092346]
State-space language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers.<n>We address these limitations by reframing sequence modelling through a probabilistic lens.<n>We introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference.
arXiv Detail & Related papers (2026-02-11T11:11:45Z) - Gated Differentiable Working Memory for Long-Context Language Modeling [80.27483324685434]
We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process.<n>Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$times$ fewer gradient steps than uniform baselines.
arXiv Detail & Related papers (2026-01-19T10:00:33Z) - Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution [3.551701030393209]
Phase Gradient Flow (PGF) is a framework that computes exact analytical derivatives by operating directly in the state-space manifold.<n>Our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd.<n>Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.
arXiv Detail & Related papers (2025-12-28T20:27:58Z) - GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory [7.180426235884756]
GatedFWA is a memory-underlineGated (underlineFlash) underlineWindowed underlineAttention mechanism.<n>It stabilizes memory updates and makes gradient flow controllable.<n>On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead.
arXiv Detail & Related papers (2025-12-08T18:11:06Z) - Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs [17.499497967319332]
We introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining.<n>DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.<n>Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.
arXiv Detail & Related papers (2025-10-28T16:34:18Z) - Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting [6.653834890554154]
We introduce Logits Replay + MoClip, a framework that compresses supervision in the logit space and stabilizes optimization at the update level.<n> Empirically, our method improves domain performance on Communication Technology tasks while mitigating forgetting on general benchmarks.
arXiv Detail & Related papers (2025-10-10T08:55:32Z) - The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z) - On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention [53.22963042513293]
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs.<n>We first propose dual-state linear attention (A), a novel design that maintains two hidden states-one for preserving historical context and one for tracking recencythereby mitigating the short-range bias typical of linear-attention architectures.<n>We introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers DSLA layers at inference time, guided by a sensitivity-based layer ordering.
arXiv Detail & Related papers (2025-06-11T01:25:06Z) - Forget Forgetting: Continual Learning in a World of Abundant Memory [55.64184779530581]
Continual learning has traditionally focused on minimizing exemplar memory.<n>This paper challenges this paradigm by investigating a more realistic regime.<n>We find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones.
arXiv Detail & Related papers (2025-02-11T05:40:52Z) - Adaptive Probabilistic ODE Solvers Without Adaptive Memory Requirements [6.0735728088312175]
We develop an adaptive probabilistic solver with fixed memory demands.<n>Switching to our method eliminates memory issues for long time series.<n>We also accelerate simulations by orders of magnitude through unlocking just-in-time compilation.
arXiv Detail & Related papers (2024-10-14T14:10:47Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Sketchy: Memory-efficient Adaptive Regularization with Frequent
Directions [22.09320263962004]
We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace.
We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner.
We show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.
arXiv Detail & Related papers (2023-02-07T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.