SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
- URL: http://arxiv.org/abs/2407.04752v1
- Date: Fri, 5 Jul 2024 08:37:17 GMT
- Title: SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
- Authors: Xingrun Xing, Boyan Gao, Zheng Zhang, David A. Clifton, Shitao Xiao, Li Du, Guoqi Li, Jiajun Zhang,
- Abstract summary: Recent large language models (LLMs) with billions of parameters have boosted their performance across various real-world applications.
Human brains exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters.
We propose the first spiking large language model as recent LLMs termed SpikeLLM.
- Score: 43.275370104552344
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these models require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters. Inspired by this, we redesign 7 to 70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model as recent LLMs termed SpikeLLM. Coupled with the proposed model, a novel spike-driven quantization framework named Optimal Brain Spiking is introduced to reduce the energy cost and accelerate inference speed via two essential approaches: first (second)-order differentiation-based salient channel detection, and per-channel salient outlier expansion with Generalized Integrate-and-Fire neurons. Our proposed spike-driven quantization can plug in main streams of quantization training methods. In the OmniQuant pipeline, SpikeLLM significantly reduces 25.51% WikiText2 perplexity and improves 3.08% average accuracy of 6 zero-shot datasets on a LLAMA2-7B 4A4W model. In the GPTQ pipeline, SpikeLLM realizes a sparse ternary quantization, which achieves additive in all linear layers. Compared with PB-LLM with similar operations, SpikeLLM also exceeds significantly. We will release our code on GitHub.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models.
It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency.
Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z) - SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms [30.825695629006628]
Bio-inspired spiking neural networks (SNNs) have advantages of biological plausibility, event-driven sparsity, and binary activation.
Large-scale language models exhibit promising generalization capability, making it a valuable issue to explore more general spike-driven models.
This work proposes the first fully spiking mechanism for general language tasks, including both discriminative and generative ones.
arXiv Detail & Related papers (2024-06-05T13:59:03Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks [1.5199992713356987]
This paper introduces CompactifAI, an innovative compression approach using quantum-inspired networks.
Our method is versatile and can be implemented with - or on top of - other compression techniques.
As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% memory size of LlaMA 7B.
arXiv Detail & Related papers (2024-01-25T11:45:21Z) - RPTQ: Reorder-based Post-training Quantization for Large Language Models [46.03754730678076]
Large-scale language models (LLMs) have demonstrated impressive performance, but their deployment presents challenges due to their significant memory usage.
We introduce a quantization method called RPTQ, which utilizes a reorder-based approach.
In our experiments, RPTQ achieved a significant breakthrough by utilizing 3-bit activation in LLMs for the first time, resulting in a substantial reduction in memory usage.
arXiv Detail & Related papers (2023-04-03T15:46:15Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.