Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
- URL: http://arxiv.org/abs/2310.17157v1
- Date: Thu, 26 Oct 2023 05:01:09 GMT
- Title: Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
- Authors: Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song,
Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
- Abstract summary: Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications.
Existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup.
We propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer.
- Score: 90.96447932006822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) with hundreds of billions of parameters have
sparked a new wave of exciting AI applications. However, they are
computationally expensive at inference time. Sparsity is a natural approach to
reduce this cost, but existing methods either require costly retraining, have
to forgo LLM's in-context learning ability, or do not yield wall-clock time
speedup on modern hardware. We hypothesize that contextual sparsity, which are
small, input-dependent sets of attention heads and MLP parameters that yield
approximately the same output as the dense model for a given input, can address
these issues. We show that contextual sparsity exists, that it can be
accurately predicted, and that we can exploit it to speed up LLM inference in
wall-clock time without compromising LLM's quality or in-context learning
ability. Based on these insights, we propose DejaVu, a system that uses a
low-cost algorithm to predict contextual sparsity on the fly given inputs to
each layer, along with an asynchronous and hardware-aware implementation that
speeds up LLM inference. We validate that DejaVu can reduce the inference
latency of OPT-175B by over 2X compared to the state-of-the-art
FasterTransformer, and over 6X compared to the widely used Hugging Face
implementation, without compromising model quality. The code is available at
https://github.com/FMInference/DejaVu.
Related papers
- SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity [52.88892280536302]
We introduce SparseLoRA, a method that accelerates fine-tuning through contextual sparsity.<n>We show that SparseLoRA reduces computational cost by up to 2.2 times and a measured speedup of up to 1.6 times.
arXiv Detail & Related papers (2025-06-19T17:53:34Z) - Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity [4.24164487223914]
We introduce Polar Sparsity, highlighting a key shift in sparsity importance from dense to Attention layers as we scale batch size and sequence length.<n>We develop hardware-efficient, sparsity-aware kernels for selective computation and Attention, delivering up to (2.2times) end-to-end speed for models like OPT, LLaMA-2 & 3, across various batch sizes and sequence lengths without compromising accuracy.
arXiv Detail & Related papers (2025-05-20T20:15:42Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Online Scheduling for LLM Inference with KV Cache Constraints [22.155429544207827]
Large Language Model (LLM) inference is an intensive process requiring efficient scheduling to optimize latency and resource utilization.
We propose novel and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory.
Our results offer a path toward more sustainable and cost-effective LLM deployment.
arXiv Detail & Related papers (2025-02-10T23:11:44Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization [8.121663525764294]
Large language models (LLMs) play a crucial role in our daily lives due to their ability to understand and generate human-like text.
In this report, we design a collaborative inference architecture between a server and its clients to alleviate the throughput limit.
We show in the experiments that we are able to efficiently distribute the workload allowing for roughly 1/3 reduction in the server workload.
arXiv Detail & Related papers (2024-10-14T17:38:41Z) - Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference.
We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers.
Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z) - SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality.
We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.
Our experiments demonstrate that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z) - ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models [67.97667465509504]
We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns.
ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework.
arXiv Detail & Related papers (2024-06-24T13:41:08Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Are Large Language Models Temporally Grounded? [38.481606493496514]
We provide Large language models (LLMs) with textual narratives.
We probe them with respect to their common-sense knowledge of the structure and duration of events.
We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
arXiv Detail & Related papers (2023-11-14T18:57:15Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z) - Sparse Fine-tuning for Inference Acceleration of Large Language Models [48.285897264669984]
We consider the problem of accurate sparse fine-tuning of large language models (LLMs)
We perform a detailed study of distillation-type losses, determining an L2-based distillation approach we term SquareHead.
For MPT text generation, we show for the first time that sparse fine-tuning can reach 75% sparsity without accuracy drops.
arXiv Detail & Related papers (2023-10-10T18:28:38Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.