BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
- URL: http://arxiv.org/abs/2410.18701v1
- Date: Thu, 24 Oct 2024 12:53:39 GMT
- Title: BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
- Authors: Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang,
- Abstract summary: We propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch.
Compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
- Score: 4.610983384440473
- License:
- Abstract: The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
Related papers
- Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems.
requests is a critical step in scheduling jobs on servers.
requests often have varying generation lengths, causing resource underutilization.
We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z) - Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models [93.5327725085853]
Continual LLaVA is a rehearsal-free method tailored for continual instruction tuning in LVLMs.
Experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
arXiv Detail & Related papers (2024-11-04T19:55:32Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting [0.8238423959893132]
"Auto-Demo Prompting" is a novel approach that leverages the question-output pairs from earlier questions within a batch as demonstrations for subsequent answer inference.
Our method effectively bridges the gap between batch prompting and few-shot prompting, enhancing performance with only a slight compromise in token usage.
arXiv Detail & Related papers (2024-10-02T16:34:40Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.
We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.
Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Efficient Prompt Caching via Embedding Similarity [26.456212783693545]
We focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity.
We propose a distillation-based method to fine-tune the existing embeddings for better better prediction.
We also conduct simulations demonstrating that our trained models achieve better caching efficiency than the previous embedding model.
arXiv Detail & Related papers (2024-02-02T06:34:11Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - BatchGFN: Generative Flow Networks for Batch Active Learning [80.73649229919454]
BatchGFN is a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward.
We show our approach enables principled sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems.
arXiv Detail & Related papers (2023-06-26T20:41:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.