Related papers: Ensuring Fair LLM Serving Amid Diverse Applications

Ensuring Fair LLM Serving Amid Diverse Applications

URL: http://arxiv.org/abs/2411.15997v1
Date: Sun, 24 Nov 2024 22:35:44 GMT
Title: Ensuring Fair LLM Serving Amid Diverse Applications
Authors: Redwan Ibne Seraj Khan, Kunal Jain, Haiying Shen, Ankur Mallick, Anjaly Parayil, Anoop Kulkarni, Steve Kofsky, Pankhuri Choudhary, Renèe St. Amant, Rujia Wang, Yue Cheng, Ali R. Butt, Victor Rühle, Chetan Bansal, Saravan Rajmohan,
Abstract summary: This paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications.
Score: 13.346272116841288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.

Related papers

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions [85.88573535033406]
CRMArena-Pro is a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings.<n>It incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments.<n>Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings.
arXiv Detail & Related papers (2025-05-24T21:33:22Z)
The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation [73.16564415490113]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. We propose two approaches, FairFT and FairFilter, to mitigate the fairness issues introduced by RAG for small-scale LLMs.
arXiv Detail & Related papers (2025-04-11T10:17:10Z)
REALM: A Dataset of Real-World LLM Use Cases [69.57194370666876]
REALM is a dataset of over 94,000 LLM use cases collected from Reddit and news articles. Realm captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users' occupations relate to the types of applications they use.
arXiv Detail & Related papers (2025-03-24T15:39:25Z)
Examples as the Prompt: A Scalable Approach for Efficient LLM Adaptation in E-Commerce [14.436208311342261]
Examples as the Prompt (EaP) is a framework that leverages labeled data to enhance prompts. EaP achieves comparable or even superior performance compared to hand-crafted prompts. EaP_lite replaces the natural language components of prompts with labeled examples.
arXiv Detail & Related papers (2025-03-14T18:22:43Z)
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding [12.106234303559571]
We present AdaServe, the first serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding.<n>AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm.<n>It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput.
arXiv Detail & Related papers (2025-01-21T14:15:01Z)
Revisiting SLO and Goodput Metrics in LLM Serving [17.777554083636716]
Service level objectives (SLOs) and goodput-the number of requests that meet SLOs per second-are introduced to evaluate the performance of LLM serving. Existing metrics fail to capture the nature of user experience. We propose a unified metric framework smooth goodput including SLOs and goodput to reflect the nature of user experience.
arXiv Detail & Related papers (2024-10-18T08:05:37Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model [48.33280660752336]
Large language models (LLMs) show amazing performance on many domain-specific tasks after fine-tuning with some appropriate data. Many domain-specific data are privately distributed across multiple owners. We introduce FedBiOT, a resource-efficient LLM fine-tuning approach to federated learning.
arXiv Detail & Related papers (2024-06-25T16:45:47Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks. evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z)
CHOPS: CHat with custOmer Profile Systems for Customer Service with LLMs [7.888131064071474]
Current customer service models have limited integration with customer profiles. Existing API integrations emphasize diversity over the precision and error avoidance essential in real-world customer service scenarios.
arXiv Detail & Related papers (2024-03-31T07:11:48Z)
RouterBench: A Benchmark for Multi-LLM Routing System [25.515453832224804]
No single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. We present RouterBench, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems.
arXiv Detail & Related papers (2024-03-18T17:59:04Z)
Fairness in Serving Large Language Models [45.81800239353461]
This paper introduces the definition of serving fairness based on a cost function that accounts for the number of input and output tokens processed. We propose a novel scheduling algorithm, the Virtual Counter Token (VTC), a fair difference between two backlogged clients. We prove a 2x tight upper bound on the service scheduler, adhering to the requirement of work-conserving.
arXiv Detail & Related papers (2023-12-31T21:15:54Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences. We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)
LiFT: A Scalable Framework for Measuring Fairness in ML Applications [18.54302159142362]
We present the LinkedIn Fairness Toolkit (LiFT), a framework for scalable computation of fairness metrics as part of large ML systems. We discuss the challenges encountered in incorporating fairness tools in practice and the lessons learned during deployment at LinkedIn.
arXiv Detail & Related papers (2020-08-14T03:55:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.