An Efficient Inference Framework for Early-exit Large Language Models
- URL: http://arxiv.org/abs/2407.20272v1
- Date: Thu, 25 Jul 2024 07:50:17 GMT
- Title: An Efficient Inference Framework for Early-exit Large Language Models
- Authors: Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang,
- Abstract summary: Early-exit models improve the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when confident enough.
There is no work of LLM inference framework that takes early-exit models into consideration.
We solve two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management.
- Score: 5.048467183620882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building efficient inference framework has gained increasing interests for research community. Early-exit models, a variant of LLMs, improves the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when they are confident enough. However, there is no work of LLM inference framework that takes early-exit models into consideration. This is non-trivial as prior art on LLM inference cannot be directly applied to early-exit models. In this work, we solves two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management. For the former, we propose to process the batch until all sequences surpass the early-exit confidence threshold. For the latter, we propose to fill the KV cache of rest layers before the iteration terminates. Our evaluation shows that, compared with the original vLLM operating at full layers, our solution achieves up to 1.25x speed up.
Related papers
- An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - Dynamic Vocabulary Pruning in Early-Exit LLMs [0.11983702508388193]
Increasing the size of large language models (LLMs) has been shown to lead to better performance.
This comes at the cost of slower and more expensive inference.
We propose dynamically pruning the vocabulary at test time for each token.
arXiv Detail & Related papers (2024-10-24T17:52:31Z) - AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner.
Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z) - VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers.
We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z) - Efficient Prompt Caching via Embedding Similarity [26.456212783693545]
We focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity.
We propose a distillation-based method to fine-tune the existing embeddings for better better prediction.
We also conduct simulations demonstrating that our trained models achieve better caching efficiency than the previous embedding model.
arXiv Detail & Related papers (2024-02-02T06:34:11Z) - EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism [70.07661254213181]
We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs)
Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting.
Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead.
arXiv Detail & Related papers (2023-12-08T09:31:50Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.