ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse
LLMs
- URL: http://arxiv.org/abs/2402.03804v1
- Date: Tue, 6 Feb 2024 08:45:51 GMT
- Title: ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse
LLMs
- Authors: Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun
Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, Maosong Sun
- Abstract summary: We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold.
To find the most efficient activation function for sparse computation, we propose a systematic framework.
We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
- Score: 91.31204876440765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse computation offers a compelling solution for the inference of Large
Language Models (LLMs) in low-resource scenarios by dynamically skipping the
computation of inactive neurons. While traditional approaches focus on
ReLU-based LLMs, leveraging zeros in activation values, we broaden the scope of
sparse LLMs beyond zero activation values. We introduce a general method that
defines neuron activation through neuron output magnitudes and a tailored
magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse
activation. To find the most efficient activation function for sparse
computation, we propose a systematic framework to examine the sparsity of LLMs
from three aspects: the trade-off between sparsity and performance, the
predictivity of sparsity, and the hardware affinity. We conduct thorough
experiments on LLMs utilizing different activation functions, including ReLU,
SwiGLU, ReGLU, and ReLU$^2$. The results indicate that models employing
ReLU$^2$ excel across all three evaluation aspects, highlighting its potential
as an efficient activation function for sparse LLMs. We will release the code
to facilitate future research.
Related papers
- Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated.
We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric.
We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z) - Achieving Sparse Activation in Small Language Models [9.05326883263473]
Sparse activation is a useful technique to reduce the computing cost of Large Language Models (LLMs) without retraining or adaptation efforts.
In this paper, we aim to achieve sparse activation in Small Language Models (SLMs)
We first show that the existing sparse activation schemes in LLMs that build on neurons' output magnitudes cannot be applied to SLMs, and activating neurons based on their attribution scores is a better alternative.
arXiv Detail & Related papers (2024-06-03T03:21:49Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs.
This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z) - Learn To be Efficient: Build Structured Sparsity in Large Language Models [17.940183066850565]
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads.
Existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting.
We introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs.
arXiv Detail & Related papers (2024-02-09T01:18:16Z) - ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
Models [35.77063662562747]
Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications.
Their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices.
We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer.
arXiv Detail & Related papers (2023-10-06T20:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.