Related papers: Fairness in Serving Large Language Models

Fairness in Serving Large Language Models

URL: http://arxiv.org/abs/2401.00588v2
Date: Wed, 5 Jun 2024 06:43:16 GMT
Title: Fairness in Serving Large Language Models
Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica,
Abstract summary: This paper introduces the definition of serving fairness based on a cost function that accounts for the number of input and output tokens processed. We propose a novel scheduling algorithm, the Virtual Counter Token (VTC), a fair difference between two backlogged clients. We prove a 2x tight upper bound on the service scheduler, adhering to the requirement of work-conserving.
Score: 45.81800239353461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2x tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact

Related papers

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models. QuoTA strategically allocates frame-level importance scores based on query relevance. We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z)
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding [76.23719557942917]
TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens.
arXiv Detail & Related papers (2025-02-21T04:19:24Z)
Locality-aware Fair Scheduling in LLM Serving [28.707749238946253]
Large language model (LLM) inference workload dominates a wide variety of modern AI applications. Balancing fairness and efficiency is critical for managing diverse client workloads with varying prefix patterns. This paper introduces the first locality-aware fair scheduling algorithm, Deficit Longest Prefix Match (DLPM)
arXiv Detail & Related papers (2025-01-24T08:12:47Z)
Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems. requests is a critical step in scheduling jobs on servers. requests often have varying generation lengths, causing resource underutilization. We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z)
Ensuring Fair LLM Serving Amid Diverse Applications [13.346272116841288]
This paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications.
arXiv Detail & Related papers (2024-11-24T22:35:44Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI) In this paper, we propose a new efficient LLM inference serving framework, named ALISE. We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z)
Efficient LLM Scheduling by Learning to Rank [19.33941579312897]
We show that it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. We develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches.
arXiv Detail & Related papers (2024-08-28T13:35:54Z)
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER) LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables. Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols. This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Scaling Laws for Discriminative Classification in Large Language Models [5.56747083508457]
We present a system that allows us to use an LLM to augment our customer support advocates by re-framing the language modeling task as a discriminative classification task. We present the result of both offline and online experiments where we observed offline gains and statistically significant online lifts for our experimental system. We close by discussing the space of trade-offs with respect to model size, latency, and accuracy as well as and suggesting future applications to explore.
arXiv Detail & Related papers (2024-05-24T17:58:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.