Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
with 1000x Input Token Reduction
- URL: http://arxiv.org/abs/2409.17422v1
- Date: Wed, 25 Sep 2024 23:14:47 GMT
- Title: Discovering the Gems in Early Layers: Accelerating Long-Context LLMs
with 1000x Input Token Reduction
- Authors: Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency.
Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption.
We propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing.
- Score: 47.38471103190534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in
handling long context inputs, but this comes at the cost of increased
computational resources and latency. Our research introduces a novel approach
for the long context bottleneck to accelerate LLM inference and reduce GPU
memory consumption. Our research demonstrates that LLMs can identify relevant
tokens in the early layers before generating answers to a query. Leveraging
this insight, we propose an algorithm that uses early layers of an LLM as
filters to select and compress input tokens, significantly reducing the context
length for subsequent processing. Our method, GemFilter, demonstrates
substantial improvements in both speed and memory efficiency compared to
existing techniques, such as standard attention and SnapKV/H2O. Notably, it
achieves a 2.4$\times$ speedup and 30\% reduction in GPU memory usage compared
to SOTA methods. Evaluation on the Needle in a Haystack task shows that
GemFilter significantly outperforms standard attention, SnapKV and demonstrates
comparable performance on the LongBench challenge. GemFilter is simple,
training-free, and broadly applicable across different LLMs. Crucially, it
provides interpretability by allowing humans to inspect the selected input
sequence. These findings not only offer practical benefits for LLM deployment,
but also enhance our understanding of LLM internal mechanisms, paving the way
for further optimizations in LLM design and inference. Our code is available at
\url{https://github.com/SalesforceAIResearch/GemFilter}.
Related papers
- Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Efficient LLM Scheduling by Learning to Rank [19.33941579312897]
We show that it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank.
We develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches.
arXiv Detail & Related papers (2024-08-28T13:35:54Z) - MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing.
We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z) - Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models [5.0490573482829335]
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities.
This paper investigates the use of a pre-filtering step before passage re-ranking in information retrieval (IR)
Our experiments show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task.
arXiv Detail & Related papers (2024-06-26T20:12:24Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Optimizing LLM Queries in Relational Workloads [58.254894049950366]
We show how to optimize Large Language Models (LLMs) inference for analytical workloads that invoke LLMs within relational queries.
We implement these optimizations in Apache Spark, with vLLM as the model serving backend.
We achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets.
arXiv Detail & Related papers (2024-03-09T07:01:44Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.