ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
Models
- URL: http://arxiv.org/abs/2310.04564v1
- Date: Fri, 6 Oct 2023 20:01:33 GMT
- Title: ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
Models
- Authors: Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel
Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar
- Abstract summary: Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications.
Their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices.
We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer.
- Score: 35.77063662562747
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) with billions of parameters have drastically
transformed AI applications. However, their demanding computation during
inference has raised significant challenges for deployment on
resource-constrained devices. Despite recent trends favoring alternative
activation functions such as GELU or SiLU, known for increased computation,
this study strongly advocates for reinstating ReLU activation in LLMs. We
demonstrate that using the ReLU activation function has a negligible impact on
convergence and performance while significantly reducing computation and weight
transfer. This reduction is particularly valuable during the memory-bound
inference step, where efficiency is paramount. Exploring sparsity patterns in
ReLU-based LLMs, we unveil the reutilization of activated neurons for
generating new tokens and leveraging these insights, we propose practical
strategies to substantially reduce LLM inference computation up to three times,
using ReLU activations with minimal performance trade-offs.
Related papers
- R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.
Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models [53.4530106173067]
Large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks.
RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively.
This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge.
arXiv Detail & Related papers (2025-04-03T04:46:17Z) - LoRS: Efficient Low-Rank Adaptation for Sparse Large Language Model [21.98687961440789]
Existing low-rank adaptation (LoRA) methods face challenges on sparse large language models (LLMs) due to the inability to maintain sparsity.
Recent works introduced methods that maintain sparsity by augmenting LoRA techniques with additional masking mechanisms.
We introduce LoRS, an innovative method designed to achieve both memory and computation efficiency when fine-tuning sparse LLMs.
arXiv Detail & Related papers (2025-01-15T05:07:06Z) - Explore Activation Sparsity in Recurrent LLMs for Energy-Efficient Neuromorphic Computing [3.379854610429579]
Recurrent Large Language Models (R-LLM) have proven effective in mitigating the complexity of self-attention.
We propose a low-cost, training-free algorithm to sparsify R-LLMs' activations to enhance energy efficiency on neuromorphic hardware.
arXiv Detail & Related papers (2025-01-09T19:13:03Z) - Hysteresis Activation Function for Efficient Inference [3.5223695602582614]
We propose a Hysteresis Rectified Linear Unit (HeLU) to address the dying ReLU'' problem with minimal complexity.
Unlike traditional activation functions with fixed thresholds for training and inference, HeLU employs a variable threshold that refines the backpropagation.
arXiv Detail & Related papers (2024-11-15T20:46:58Z) - Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification [7.8430836312711465]
Large language models (LLMs) on edge devices present significant challenges due to the substantial computational overhead and memory requirements.
Activation sparsification can mitigate these challenges by reducing the number of activated neurons during inference.
This paper introduces CHESS, a general activation sparsification approach via CHannel-wise thrEsholding and Selective Sparsification.
arXiv Detail & Related papers (2024-09-02T16:41:44Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study [20.404448253054014]
We investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models.
Our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes.
arXiv Detail & Related papers (2024-05-15T11:42:42Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs.
This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z) - ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse
LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold.
To find the most efficient activation function for sparse computation, we propose a systematic framework.
We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.