Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
- URL: http://arxiv.org/abs/2411.02335v1
- Date: Mon, 04 Nov 2024 17:59:04 GMT
- Title: Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
- Authors: Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun,
- Abstract summary: Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
- Score: 62.09617609556697
- License:
- Abstract: Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
Related papers
- Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features [115.33889811527533]
Diffusion models are initially designed for image generation.
Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks.
arXiv Detail & Related papers (2024-10-04T16:05:14Z) - CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [7.8430836312711465]
Large language models (LLMs) on edge devices present significant challenges due to the substantial computational overhead and memory requirements.
Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference.
This paper introduces CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification.
arXiv Detail & Related papers (2024-09-02T16:41:44Z) - ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [67.97667465509504]
We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns.
ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework.
arXiv Detail & Related papers (2024-06-24T13:41:08Z) - ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs.
This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z) - Learn To be Efficient: Build Structured Sparsity in Large Language Models [17.940183066850565]
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads.
Existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting.
We introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs.
arXiv Detail & Related papers (2024-02-09T01:18:16Z) - ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse
LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold.
To find the most efficient activation function for sparse computation, we propose a systematic framework.
We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z) - ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
Models [35.77063662562747]
Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications.
Their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices.
We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer.
arXiv Detail & Related papers (2023-10-06T20:01:33Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.