The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving
- URL: http://arxiv.org/abs/2411.07447v2
- Date: Tue, 19 Nov 2024 21:57:16 GMT
- Title: The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving
- Authors: Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki,
- Abstract summary: INFERMAX is an analytical framework that uses inference cost models to compare various schedulers.
Our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all.
- Score: 8.552242818726347
- License:
- Abstract: The growing usage of Large Language Models (LLMs) highlights the demands and challenges in scalable LLM inference systems, affecting deployment and development processes. On the deployment side, there is a lack of comprehensive analysis on the conditions under which a particular scheduler performs better or worse, with performance varying substantially across different schedulers, hardware, models, and workloads. Manually testing each configuration on GPUs can be prohibitively expensive. On the development side, unpredictable performance and unknown upper limits can lead to inconclusive trial-and-error processes, consuming resources on ideas that end up ineffective. To address these challenges, we introduce INFERMAX, an analytical framework that uses inference cost models to compare various schedulers, including an optimal scheduler formulated as a constraint satisfaction problem (CSP) to establish an upper bound on performance. Our framework offers in-depth analysis and raises essential questions, challenging assumptions and exploring opportunities for more efficient scheduling. Notably, our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all. We believe our methods and insights will facilitate the cost-effective deployment and development of scalable, efficient inference systems and pave the way for cost-based scheduling.
Related papers
- On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks.
Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z) - Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning [10.704716790096498]
Large language models (LLMs) have demonstrated impressive task-solving capabilities, achieved through either prompting techniques or system designs.
This paper investigates the impact of fine-tuning on LLMs' planning capabilities.
We propose the Maximum Diversity Fine-Tuning (MDFT) strategy to improve the sample efficiency of fine-tuning in the planning domain.
arXiv Detail & Related papers (2024-06-15T03:06:14Z) - Differentiable Combinatorial Scheduling at Scale [18.09256072039255]
We propose a differentiable scheduling framework, utilizing Gumbel-Softmax differentiable sampling technique.
To encode inequality constraints for scheduling tasks, we introduce textitconstrained Gumbel Trick, which adeptly encodes arbitrary inequality constraints.
Our method facilitates an efficient and scalable scheduling via gradient descent without the need for training data.
arXiv Detail & Related papers (2024-06-06T02:09:39Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Can LLMs Configure Software Tools [0.76146285961466]
In software engineering, the meticulous configuration of software tools is crucial in ensuring optimal performance within intricate systems.
In this study, we embark on an exploration of leveraging Large-Language Models (LLMs) to streamline the software configuration process.
Our work presents a novel approach that employs LLMs, such as Chat-GPT, to identify starting conditions and narrow down the search space, improving configuration efficiency.
arXiv Detail & Related papers (2023-12-11T05:03:02Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z) - Planning with Dynamically Estimated Action Costs [2.8326418377665346]
Information about action costs is critical for real-world AI planning applications.
Recent approaches use black-box external action cost estimators, often learned from data, that are applied during the planning phase.
We suggest a generalization of deterministic planning with action costs that allows selecting between multiple estimators for action cost.
arXiv Detail & Related papers (2022-06-08T21:10:37Z) - Uncertainty-aware Remaining Useful Life predictor [57.74855412811814]
Remaining Useful Life (RUL) estimation is the problem of inferring how long a certain industrial asset can be expected to operate.
In this work, we consider Deep Gaussian Processes (DGPs) as possible solutions to the aforementioned limitations.
The performance of the algorithms is evaluated on the N-CMAPSS dataset from NASA for aircraft engines.
arXiv Detail & Related papers (2021-04-08T08:50:44Z) - Integration of Convolutional Neural Networks in Mobile Applications [3.0280987248827085]
We study the performance of a system that integrates a Deep Learning model as a trade-off between the accuracy and the complexity.
We identify the most concerning challenges when deploying DL-based software in mobile applications.
arXiv Detail & Related papers (2021-03-11T15:27:05Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.