LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling
- URL: http://arxiv.org/abs/2602.11562v1
- Date: Thu, 12 Feb 2026 04:33:37 GMT
- Title: LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling
- Authors: Tianhe Lin, Ziwei Xiong, Baoyuan Ou, Yingjie Qin, Lai Xu, Xiaocheng Zhong, Yao Hu, Zhiyong Wang, Tao Zhou, Yubin Xu, Di Wu,
- Abstract summary: We present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote)<n>System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories.<n>Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead.
- Score: 20.507605423606282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
Related papers
- AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting [59.31340724915079]
Event Spotting is a key task for applications in sports analytics, robotics, and autonomous systems.<n>bfAdaSpot achieves state-of-the-art performance under strict evaluation metrics.
arXiv Detail & Related papers (2026-02-25T16:24:48Z) - GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder [54.64137490632567]
We propose a novel and unified framework designed to capture users' sequences from long-term history.<n>Generative Multi-streamers ( GEMs) break user sequences into three streams.<n>Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-the-art methods in recommendation accuracy.
arXiv Detail & Related papers (2026-02-14T06:42:56Z) - Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention [28.598033369607723]
textscLight Forcing is a textitfirst sparse attention solution tailored for AR video generation models.<n>It incorporates a textitChunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk.<n>We also introduce a textit Sparse Attention to capture informative historical and local context in a coarse-to-fine manner.
arXiv Detail & Related papers (2026-02-04T17:41:53Z) - Evolutionary Mapping of Neural Networks to Spatial Accelerators [64.13809409887254]
We introduce the first evolutionary, hardware-in-the-loop mapping framework for neuromorphic accelerators.<n>We evaluate our approach on Intel Loihi 2, a representative spatial accelerator featuring 152 cores in a 2D mesh.<n>Our method achieves up to 35% reduction in total latency compared to default cores on two sparse multi-layer perceptron networks.
arXiv Detail & Related papers (2026-02-04T16:28:08Z) - SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration [23.86429472943524]
We present a training-free acceleration framework that exploits three properties of Visual AutoRegressive attention: strong attention sinks, cross-scale activation similarity, and pronounced locality.<n>Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism.<n>Our method achieves a $mathbf1.57times$ speed-up while preserving almost all high-frequency details.
arXiv Detail & Related papers (2026-02-04T09:34:06Z) - OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent [58.07447442040785]
We introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation.<n>Results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales.
arXiv Detail & Related papers (2026-01-12T17:55:51Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z) - LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders [23.70714095931094]
Long-sequence optimized traNsformer for GPU-Efficient Recommenders.<n>Longer consistently outperforms strong baselines in offline metrics and online A/B testing.
arXiv Detail & Related papers (2025-05-07T13:54:26Z) - PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z) - Climber: Toward Efficient Scaling Laws for Large Recommendation Models [8.970144942471425]
We propose Climber, an efficient recommendation framework comprising two synergistic components.<n>Our proposed model adopts two core innovations: (1) multi-scale sequence extraction that achieves a time complexity reduction by a constant factor, enabling more efficient scaling with sequence length; (2) dynamic temperature modulation adapting attention distributions to the multi-scenario and multi-behavior patterns.<n> Climber has been successfully deployed on Netease Cloud Music, one of China's largest music streaming platforms, serving tens of millions of users daily.
arXiv Detail & Related papers (2025-02-14T03:25:09Z) - ELASTIC: Efficient Linear Attention for Sequential Interest Compression [5.689306819772134]
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism.<n>We propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression.<n>We conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders.
arXiv Detail & Related papers (2024-08-18T06:41:46Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation [15.35494431928751]
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving.<n>We introduce model-attention disaggregation to enhance the efficiency of LLM decoding.<n>We develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster.
arXiv Detail & Related papers (2024-05-03T02:15:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.