GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
- URL: http://arxiv.org/abs/2512.07782v1
- Date: Mon, 08 Dec 2025 18:11:06 GMT
- Title: GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
- Authors: Jiaxu Liu, Yuhe Bai, Christos-Savvas Bouganis,
- Abstract summary: GatedFWA is a memory-underlineGated (underlineFlash) underlineWindowed underlineAttention mechanism.<n>It stabilizes memory updates and makes gradient flow controllable.<n>On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead.
- Score: 7.180426235884756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.
Related papers
- Stateful Token Reduction for Long-Video Hybrid VLMs [69.6930118088911]
We study query-conditioned token reduction for hybrid video vision-language models (VLMs)<n>We propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks.<n>Under an aggressive compression setting, our approach delivers substantial prefilling speedups with near-baseline accuracy at test time.
arXiv Detail & Related papers (2026-02-27T08:11:06Z) - AllMem: A Memory-centric Recipe for Efficient Long-context Modeling [32.025154452526856]
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks.<n>We introduce textscAllMem, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks.
arXiv Detail & Related papers (2026-02-14T09:04:28Z) - Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z) - Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression [53.48692193399171]
Gated KalmaNet (GKA) is a layer that reduces the gap by accounting for the full past when predicting the next token.<n>We solve an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.<n>On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
arXiv Detail & Related papers (2025-11-26T03:26:37Z) - OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z) - REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z) - SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers [15.142822497807236]
We propose SCOUT, a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations.<n>SCOUT retains much of the expressivity of full attention while substantially reducing the computational and memory cost.<n>We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks.
arXiv Detail & Related papers (2025-08-31T17:08:33Z) - On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention [53.22963042513293]
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs.<n>We first propose dual-state linear attention (A), a novel design that maintains two hidden states-one for preserving historical context and one for tracking recencythereby mitigating the short-range bias typical of linear-attention architectures.<n>We introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers DSLA layers at inference time, guided by a sensitivity-based layer ordering.
arXiv Detail & Related papers (2025-06-11T01:25:06Z) - Efficient Pretraining Length Scaling [21.4715211093876]
We present the Parallel Hidden Decoding Transformer (textitPHD-Transformer), a novel framework that enables efficient length scaling during pre-training.<n>textitPHD-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens.
arXiv Detail & Related papers (2025-04-21T09:41:26Z) - Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs [6.222287867011644]
We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy.<n>Unlike retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens.<n>Our studies show 52.9$%$ memory savings and 18.2$%$ higher accuracy on average compared to state-of-the-art prior works.
arXiv Detail & Related papers (2025-03-02T18:12:50Z) - Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention.<n>We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.