Related papers: Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

URL: http://arxiv.org/abs/2406.05955v2
Date: Tue, 11 Jun 2024 02:15:47 GMT
Title: Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
Authors: Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen,
Abstract summary: Activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. We propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio. On mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second.
Score: 20.093224415258174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at \url{https://huggingface.co/PowerInfer}

Related papers

R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
Activation Sparsity Opportunities for Compressing General Large Language Models [4.5624217435826]
This work systematically investigates the tradeoff between enforcing activation sparsity and perplexity (accuracy) on state-of-the-art AI models. Our empirical analysis demonstrates that we can obtain around 50% of main memory and computing reductions for critical FFN components with negligible accuracy degradation.
arXiv Detail & Related papers (2024-12-13T02:26:54Z)
Mixture of Hidden-Dimensions Transformer [50.40325486463241]
We study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions. We propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost.
arXiv Detail & Related papers (2024-12-07T13:15:22Z)
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [12.664307714758843]
We introduce a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. We describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate. DIP achieves a 46% reduction in memory and 40% increase in throughput with $$ 0.1 loss in perplexity when compared to streaming the dense model from Flash.
arXiv Detail & Related papers (2024-12-02T11:07:51Z)
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated. We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric. We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z)
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z)
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [7.8430836312711465]
Large language models (LLMs) on edge devices present significant challenges due to the substantial computational overhead and memory requirements. Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference. This paper introduces CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification.
arXiv Detail & Related papers (2024-09-02T16:41:44Z)
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [67.97667465509504]
We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns. ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework.
arXiv Detail & Related papers (2024-06-24T13:41:08Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights [2.8461446020965435]
We introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing Latent Diffusion Models. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG)
arXiv Detail & Related papers (2024-04-18T06:35:37Z)
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z)
Learn To be Efficient: Build Structured Sparsity in Large Language Models [17.940183066850565]
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. Existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting. We introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs.
arXiv Detail & Related papers (2024-02-09T01:18:16Z)
ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold. To find the most efficient activation function for sparse computation, we propose a systematic framework. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.