Related papers: The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

URL: http://arxiv.org/abs/2504.17768v1
Date: Thu, 24 Apr 2025 17:39:25 GMT
Title: The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Authors: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti,
Abstract summary: We show that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones.<n>There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios.<n>We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments.
Score: 40.35884943268004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

Related papers

R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning [19.27175827358111]
Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. We propose a novel continual full fine-tuning approach leveraging adaptive singular value decomposition (SVD) We evaluate our approach extensively on standard continual learning benchmarks using both encoder-decoder (T5-Large) and decoder-only (LLaMA-2 7B) models.
arXiv Detail & Related papers (2025-04-09T17:59:42Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning [22.950914612765494]
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks.<n>Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph.<n>We propose the Adaptive Zeroth-order-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods.
arXiv Detail & Related papers (2024-06-26T04:33:13Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Low-Rank Few-Shot Adaptation of Vision-Language Models [13.803180972839213]
We introduce Low-Rank Adaptation (LoRA) in few-shot learning for Vision-Language Models (VLMs) Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times. Our results do not dismiss the potential of prompt-learning and adapter-based research.
arXiv Detail & Related papers (2024-05-28T19:16:59Z)
An Empirical Study of Automated Vulnerability Localization with Large Language Models [21.84971967029474]
Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored. Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models. We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning.
arXiv Detail & Related papers (2024-03-30T08:42:10Z)
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models [42.95555008229016]
We propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. The advantages of the proposed method exhibit even more when the sparsity is extremely high.
arXiv Detail & Related papers (2023-10-14T05:43:09Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter [113.35761858962522]
This paper studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of essential sparsity defined with a sharp dropping point beyond which the performance declines much faster. We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models.
arXiv Detail & Related papers (2023-06-06T15:49:09Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Evolving Metric Learning for Incremental and Decremental Features [45.696514400861275]
We develop a new online Evolving Metric Learning model for incremental and decremental features. Our model can handle the instance and feature evolutions simultaneously by incorporating with a smoothed Wasserstein metric distance. In addition to tackling the challenges in one-shot case, we also extend our model into multishot scenario.
arXiv Detail & Related papers (2020-06-27T10:29:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.