Related papers: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

URL: http://arxiv.org/abs/2502.11089v1
Date: Sun, 16 Feb 2025 11:53:44 GMT
Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng,
Abstract summary: We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision.
Score: 32.48360534726024
License:
Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

Related papers

Efficient Implementation of LinearUCB through Algorithmic Improvements and Vector Computing Acceleration for Embedded Learning Systems [0.10470286407954035]
This paper presents algorithmic and hardware techniques to implement two LinearUCB Contextual Bandits algorithms on resource-constrained embedded devices. Results show notable improvements in execution time and energy consumption.
arXiv Detail & Related papers (2025-01-22T13:39:44Z)
Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
Bioinspired Cortex-based Fast Codebook Generation [0.09449650062296822]
We introduce a feature extraction method inspired by sensory cortical networks in the brain. Dubbed as bioinspired cortex, the algorithm provides convergence to features from streaming signals with superior computational efficiency. We show herein the superior performance of the cortex model in clustering and vector quantization.
arXiv Detail & Related papers (2022-01-28T18:37:43Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
HAPI: Hardware-Aware Progressive Inference [18.214367595727037]
Convolutional neural networks (CNNs) have recently become the state-of-the-art in a diversity of AI tasks. Despite their popularity, CNN inference still comes at a high computational cost. This work presents HAPI, a novel methodology for generating high-performance early-exit networks.
arXiv Detail & Related papers (2020-08-10T09:55:18Z)
Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs [13.628734116014819]
Deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN) There is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. We present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory.
arXiv Detail & Related papers (2020-06-09T11:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.