Related papers: BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching

URL: http://arxiv.org/abs/2410.18701v1
Date: Thu, 24 Oct 2024 12:53:39 GMT
Title: BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
Authors: Peizhuang Cong, Qizhi Chen, Haochen Zhao, Tong Yang,
Abstract summary: We propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch. Compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.
Score: 4.610983384440473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advanced capabilities of Large Language Models (LLMs) have inspired the development of various interactive web services or applications, such as ChatGPT, which offer query inference services for users. Unlike traditional DNN model, the inference of LLM entails different iterations of forward computation for different queries, which result in efficiency challenges for existing run-to-completion batch-wise inference. Hence, some methods refine batch-wise inference to iteration-level by duplicating all nonlinear layers of LLM. However, this approach not only increases resource usage but also introduces idle computations to the batch due to the prefilling of newly added queries. Therefore, we propose BATON, an efficient batch-wise LLM inference scheme by dynamically adjusting processing batch, which can achieve near-zero idle computations without incurring additional resource consumption. To do so, BATON 1) shapes the vectors involved in the inference of the newly inserted query and processing batch to align dimensions and generates a new attention mask based on vector shaping to ensure inference correctness, which enables query inserting without consuming additional resource; 2) embeds prefilled Keys and Values of the new query into the KV_Cache of the processing batch by leveraging the prefilling and decoding separation mechanism, eliminating idle computations to the batch introduced by the prefilling process of the new query. Experimental results show that compared to the state-of-the-art solution Orca, BATON improves query processing by up to 1.75 times.

Related papers

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding [76.23719557942917]
TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens.
arXiv Detail & Related papers (2025-02-21T04:19:24Z)
Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems. requests is a critical step in scheduling jobs on servers. requests often have varying generation lengths, causing resource underutilization. We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z)
Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models [93.5327725085853]
Continual LLaVA is a rehearsal-free method tailored for continual instruction tuning in LVLMs. Experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
arXiv Detail & Related papers (2024-11-04T19:55:32Z)
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning [19.16587730306472]
In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs) We propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations.
arXiv Detail & Related papers (2024-10-14T01:34:16Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting [0.8238423959893132]
"Auto-Demo Prompting" is a novel approach that leverages the question-output pairs from earlier questions within a batch as demonstrations for subsequent answer inference. Our method effectively bridges the gap between batch prompting and few-shot prompting, enhancing performance with only a slight compromise in token usage.
arXiv Detail & Related papers (2024-10-02T16:34:40Z)
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Efficient Prompt Caching via Embedding Similarity [26.456212783693545]
We focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity. We propose a distillation-based method to fine-tune the existing embeddings for better better prediction. We also conduct simulations demonstrating that our trained models achieve better caching efficiency than the previous embedding model.
arXiv Detail & Related papers (2024-02-02T06:34:11Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
BatchGFN: Generative Flow Networks for Batch Active Learning [80.73649229919454]
BatchGFN is a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward. We show our approach enables principled sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems.
arXiv Detail & Related papers (2023-06-26T20:41:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.