CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
- URL: http://arxiv.org/abs/2409.01366v1
- Date: Mon, 2 Sep 2024 16:41:44 GMT
- Title: CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
- Authors: Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li,
- Abstract summary: Large language models (LLMs) on edge devices present significant challenges due to the substantial computational overhead and memory requirements.
Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference.
This paper introduces CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification.
- Score: 7.8430836312711465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying large language models (LLMs) on edge devices presents significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. Existing methods typically employ thresholding-based sparsification based on the statistics of activation tensors. However, these methods do not explicitly model the impact of activation sparsification on performance, leading to suboptimal performance degradation. To address this issue, this paper reformulates the activation sparsification problem by introducing a new objective that optimizes the sparsification decisions. Building on this reformulation, we propose CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification. First, channel-wise thresholding assigns a unique threshold to each activation channel in the feed-forward network (FFN) layers. Then, selective sparsification involves applying thresholding-based activation sparsification to specific layers within the attention modules. Finally, we detail the implementation of sparse kernels to accelerate LLM inference. Experimental results demonstrate that the proposed CHESS achieves lower performance degradation over 8 downstream tasks while activating fewer parameters compared to existing methods, thus speeding up the LLM inference by up to 1.27x.
Related papers
- Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures [21.18741772731095]
Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values.
Existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance.
This paper proposes a low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs.
arXiv Detail & Related papers (2024-10-10T08:10:53Z) - PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead [24.611413814466978]
Large language models (LLMs) enhanced with retrieval-augmented generation (RAG) have introduced a new paradigm for web search.
Existing methods to enhance context awareness are often inefficient, incurring time or memory overhead during inference.
We propose Position-Embedding-Agnostic attention Re-weighting (PEAR) which enhances the context awareness of LLMs with zero inference overhead.
arXiv Detail & Related papers (2024-09-29T15:40:54Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs.
This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z) - Learn To be Efficient: Build Structured Sparsity in Large Language Models [17.940183066850565]
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads.
Existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting.
We introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs.
arXiv Detail & Related papers (2024-02-09T01:18:16Z) - ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
Models [35.77063662562747]
Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications.
Their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices.
We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer.
arXiv Detail & Related papers (2023-10-06T20:01:33Z) - Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion [4.716845031095804]
Transformer models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost.
We show that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model.
We also introduce a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis.
arXiv Detail & Related papers (2023-10-06T16:34:51Z) - Controlled Sparsity via Constrained Optimization or: How I Learned to
Stop Tuning Penalties and Love Constraints [81.46143788046892]
We focus on the task of controlling the level of sparsity when performing sparse learning.
Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor.
We propose a constrained formulation where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion.
arXiv Detail & Related papers (2022-08-08T21:24:20Z) - Learning Bayesian Sparse Networks with Full Experience Replay for
Continual Learning [54.7584721943286]
Continual Learning (CL) methods aim to enable machine learning models to learn new tasks without catastrophic forgetting of those that have been previously mastered.
Existing CL approaches often keep a buffer of previously-seen samples, perform knowledge distillation, or use regularization techniques towards this goal.
We propose to only activate and select sparse neurons for learning current and past tasks at any stage.
arXiv Detail & Related papers (2022-02-21T13:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.