Related papers: Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study

Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study

URL: http://arxiv.org/abs/2405.09274v1
Date: Wed, 15 May 2024 11:42:42 GMT
Title: Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study
Authors: Chi Ma, Mincong Huang, Chao Wang, Yujie Wang, Lei Yu,
Abstract summary: We investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes.
Score: 20.404448253054014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we systematically investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Despite the potential of dynamic activation methods to reduce computation and increase speed in models using the ReLU activation function, our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes. Through extensive experiments across various dynamic activation strategies, we demonstrate that LLaMA models usually underperform when compared to their ReLU counterparts, particularly in scenarios demanding high sparsity ratio. We attribute these deficiencies to a combination of factors: 1) the inherent complexity of dynamically predicting activation heads and neurons; 2) the inadequate sparsity resulting from activation functions; 3) the insufficient preservation of information resulting from KV cache skipping. Our analysis not only sheds light on the limitations of dynamic activation in the context of large-scale LLaMA models but also proposes roadmaps for enhancing the design of future sparsity schemes.

Related papers

A Refined Analysis of Massive Activations in LLMs [0.3574867616159909]
We conduct an analysis of massive activations across a broad range of large language models (LLMs) Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; and (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases.
arXiv Detail & Related papers (2025-03-28T11:08:34Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge. Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity [62.09617609556697]
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated. We propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric. We show that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity.
arXiv Detail & Related papers (2024-11-04T17:59:04Z)
Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations. This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z)
First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models [25.15698344467722]
This paper introduces a training-free Threshold-based Dynamic Activation method that leverage sequence information to exploit the inherent sparsity of models across various architectures. We theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia.
arXiv Detail & Related papers (2024-08-21T07:38:51Z)
MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs [20.404448253054014]
Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models. Massive Over-activation Yielded Uplifts(MOYU) is a clever yet under-explored strategy designed to accelerate inference in these models.
arXiv Detail & Related papers (2024-06-18T12:57:33Z)
A Method on Searching Better Activation Functions [15.180864683908878]
We propose Entropy-based Activation Function Optimization (EAFO) methodology for designing static activation functions in deep neural networks. We derive a novel activation function from ReLU, known as Correction Regularized ReLU (CRReLU)
arXiv Detail & Related papers (2024-05-19T03:48:05Z)
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models [74.59731375779934]
Activation sparsity refers to the existence of weakly-contributed elements among activation outputs. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity.
arXiv Detail & Related papers (2024-02-21T03:58:49Z)
ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs [91.31204876440765]
We introduce a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold. To find the most efficient activation function for sparse computation, we propose a systematic framework. We conduct thorough experiments on LLMs utilizing different activation functions, including ReLU, SwiGLU, ReGLU, and ReLU$2$.
arXiv Detail & Related papers (2024-02-06T08:45:51Z)
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models [35.77063662562747]
Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. Their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer.
arXiv Detail & Related papers (2023-10-06T20:01:33Z)
Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning. We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z)
Deep Bayesian Active Learning for Accelerating Stochastic Simulation [74.58219903138301]
Interactive Neural Process (INP) is a deep active learning framework for simulations and with active learning approaches. For active learning, we propose a novel acquisition function, Latent Information Gain (LIG), calculated in the latent space of NP based models. The results demonstrate STNP outperforms the baselines in the learning setting and LIG achieves the state-of-the-art for active learning.
arXiv Detail & Related papers (2021-06-05T01:31:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.