Explore Activation Sparsity in Recurrent LLMs for Energy-Efficient Neuromorphic Computing
- URL: http://arxiv.org/abs/2501.16337v1
- Date: Thu, 09 Jan 2025 19:13:03 GMT
- Title: Explore Activation Sparsity in Recurrent LLMs for Energy-Efficient Neuromorphic Computing
- Authors: Ivan Knunyants, Maryam Tavakol, Manolis Sifalakis, Yingfu Xu, Amirreza Yousefzadeh, Guangzhi Tang,
- Abstract summary: Recurrent Large Language Models (R-LLM) have proven effective in mitigating the complexity of self-attention.
We propose a low-cost, training-free algorithm to sparsify R-LLMs' activations to enhance energy efficiency on neuromorphic hardware.
- Score: 3.379854610429579
- License:
- Abstract: The recent rise of Large Language Models (LLMs) has revolutionized the deep learning field. However, the desire to deploy LLMs on edge devices introduces energy efficiency and latency challenges. Recurrent LLM (R-LLM) architectures have proven effective in mitigating the quadratic complexity of self-attention, making them a potential paradigm for computing on-edge neuromorphic processors. In this work, we propose a low-cost, training-free algorithm to sparsify R-LLMs' activations to enhance energy efficiency on neuromorphic hardware. Our approach capitalizes on the inherent structure of these models, rendering them well-suited for energy-constrained environments. Although primarily designed for R-LLMs, this method can be generalized to other LLM architectures, such as transformers, as demonstrated on the OPT model, achieving comparable sparsity and efficiency improvements. Empirical studies illustrate that our method significantly reduces computational demands while maintaining competitive accuracy across multiple zero-shot learning benchmarks. Additionally, hardware simulations with the SENECA neuromorphic processor underscore notable energy savings and latency improvements. These results pave the way for low-power, real-time neuromorphic deployment of LLMs and demonstrate the feasibility of training-free on-chip adaptation using activation sparsity.
Related papers
- DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.
We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.
Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - Low-Rank Adapters Meet Neural Architecture Search for LLM Compression [1.8434042562191815]
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment.
Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models.
This paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques.
arXiv Detail & Related papers (2025-01-23T02:14:08Z) - eFedLLM: Efficient LLM Inference Based on Federated Learning [1.6179784294541053]
Large Language Models (LLMs) herald a transformative era in artificial intelligence (AI)
This paper introduces an effective approach that enhances the operational efficiency and affordability of LLM inference.
arXiv Detail & Related papers (2024-11-24T22:50:02Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - The Potential of Combined Learning Strategies to Enhance Energy Efficiency of Spiking Neuromorphic Systems [0.0]
This manuscript focuses on enhancing brain-inspired perceptual computing machines through a novel combined learning approach for Convolutional Spiking Neural Networks (CSNNs)
CSNNs present a promising alternative to traditional power-intensive and complex machine learning methods like backpropagation, offering energy-efficient spiking neuron processing inspired by the human brain.
arXiv Detail & Related papers (2024-08-13T18:40:50Z) - Learn To be Efficient: Build Structured Sparsity in Large Language Models [17.940183066850565]
Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads.
Existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting.
We introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs.
arXiv Detail & Related papers (2024-02-09T01:18:16Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Efficient Transformers in Reinforcement Learning using Actor-Learner
Distillation [91.05073136215886]
"Actor-Learner Distillation" transfers learning progress from a large capacity learner model to a small capacity actor model.
We demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model.
arXiv Detail & Related papers (2021-04-04T17:56:34Z) - Learning Discrete Energy-based Models via Auxiliary-variable Local
Exploration [130.89746032163106]
We propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data.
We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration.
We present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer.
arXiv Detail & Related papers (2020-11-10T19:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.