Related papers: ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

URL: http://arxiv.org/abs/2402.13516v4
Date: Wed, 3 Jul 2024 05:56:49 GMT
Title: ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models
Authors: Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun,
Abstract summary: Activation sparsity refers to the existence of weakly-contributed elements among activation outputs. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
Score: 74.59731375779934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$\times$ inference speedup.

Related papers

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation [17.75193235312511]
Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference.<n>Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning.<n>This paper introduces LaRoSA, a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning.
arXiv Detail & Related papers (2025-07-02T02:36:03Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities. LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands. We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated. We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric. We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z)
Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features [115.33889811527533]
Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks.
arXiv Detail & Related papers (2024-10-04T16:05:14Z)
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [7.8430836312711465]
This paper reformulates the activation sparsification problem to explicitly capture the relationship between activation sparsity and model performance. We propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over eight downstream tasks while activating fewer parameters than existing methods.
arXiv Detail & Related papers (2024-09-02T16:41:44Z)
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated [93.45300714803429]
We introduce Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs) Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. We also introduce Block Q-Sparse for batch training and inference.
arXiv Detail & Related papers (2024-07-15T17:59:29Z)
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [67.97667465509504]
We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns. ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework.
arXiv Detail & Related papers (2024-06-24T13:41:08Z)
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters [20.093224415258174]
Activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. We propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio. On mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second.
arXiv Detail & Related papers (2024-06-10T01:21:59Z)
Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study [20.404448253054014]
We investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes.
arXiv Detail & Related papers (2024-05-15T11:42:42Z)
Learn To be Efficient: Build Structured Sparsity in Large Language Models [17.940183066850565]
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. Existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting. We introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs.
arXiv Detail & Related papers (2024-02-09T01:18:16Z)
ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold. To find the most efficient activation function for sparse computation, we propose a systematic framework. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.