Training-Free Activation Sparsity in Large Language Models
- URL: http://arxiv.org/abs/2408.14690v2
- Date: Fri, 11 Oct 2024 20:02:15 GMT
- Title: Training-Free Activation Sparsity in Large Language Models
- Authors: James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun,
- Abstract summary: Activation sparsity can enable practical inference speedups in large language models.
Existing methods face limitations that inhibit widespread adoption.
This paper describes TEAL, a training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model.
- Score: 32.37595108771431
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
Related papers
- Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.
This refers to the cumulative effect of reconstruction errors throughout the sparsification process.
We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M.
We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency.
Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z) - Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training [3.195234044113248]
We propose textscNeuroAL, a emphtop-up algorithm for network pruning.
It modifies the block-wise and row-wise sparsity exploiting information from both the dense model and its sparse version.
It consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.
arXiv Detail & Related papers (2024-11-11T15:30:16Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models.
AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters [20.093224415258174]
Activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity.
We propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio.
On mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second.
arXiv Detail & Related papers (2024-06-10T01:21:59Z) - Dual sparse training framework: inducing activation map sparsity via Transformed $\ell1$ regularization [2.631955426232593]
This paper presents a method to induce the sparsity of activation maps based on Transformed $ell1$ regularization.
Compared to previous methods, Transformed $ell1$ can achieve higher sparsity and better adapt to different network structures.
The dual sparse training framework can greatly reduce the computational load and provide potential for reducing the required storage during runtime.
arXiv Detail & Related papers (2024-05-30T03:11:21Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs.
This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.