Related papers: Queueing, Predictions, and LLMs: Challenges and Open Problems

Queueing, Predictions, and LLMs: Challenges and Open Problems

URL: http://arxiv.org/abs/2503.07545v1
Date: Mon, 10 Mar 2025 17:12:47 GMT
Title: Queueing, Predictions, and LLMs: Challenges and Open Problems
Authors: Michael Mitzenmacher, Rana Shahout,
Abstract summary: Queueing systems present opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance.<n>Recent studies explore queues with predicted service times, typically aiming to minimize job time in the system.<n>We consider an important practical example of using predictions in scheduling, namely Large Language Model (LLM) systems.
Score: 9.22255012731159
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Queueing systems present many opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance. This integration raises numerous open questions about how predictions can be effectively leveraged to improve scheduling decisions. Recent studies explore queues with predicted service times, typically aiming to minimize job time in the system. We review these works, highlight the effectiveness of predictions, and present open questions on queue performance. We then move to consider an important practical example of using predictions in scheduling, namely Large Language Model (LLM) systems, which presents novel scheduling challenges and highlights the potential for predictions to improve performance. In particular, we consider LLMs performing inference. Inference requests (jobs) in LLM systems are inherently complex; they have variable inference times, dynamic memory footprints that are constrained by key-value (KV) store memory limitations, and multiple possible preemption approaches that affect performance differently. We provide background on the important aspects of scheduling in LLM systems, and introduce new models and open problems that arise from them. We argue that there are significant opportunities for applying insights and analysis from queueing theory to scheduling in LLM systems.

Related papers

ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor [5.097511974401423]
ELIS is a serving system for Large Language Models (LLMs) featuring an Iterative Shortest Remaining Time First (ISRTF) scheduler.<n>ISRTF scheduler efficiently manages inference tasks with the shortest remaining time.
arXiv Detail & Related papers (2025-05-14T04:50:00Z)
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents [6.318292471845427]
We develop the queuing fundamentals for large language model (LLM) inference. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput.
arXiv Detail & Related papers (2025-04-10T00:12:12Z)
Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems.<n> requests is a critical step in scheduling jobs on servers.<n> requests often have varying generation lengths, causing resource underutilization.<n>We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z)
Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z)
Don't Stop Me Now: Embedding Based Scheduling for LLMs [22.099820814682513]
Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time. We propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems.
arXiv Detail & Related papers (2024-10-01T19:51:07Z)
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms [34.818641985348805]
Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters.
arXiv Detail & Related papers (2024-09-25T07:38:02Z)
LLMs can Schedule [3.435169201271934]
Job shop scheduling problem (JSSP) remains a significant hurdle in optimizing production processes. This paper explores the potential of Large Language Models (LLMs) for JSSP. Surprisingly, our findings demonstrate that LLM-based scheduling can achieve performance comparable to other neural approaches.
arXiv Detail & Related papers (2024-08-13T15:53:58Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs) LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text. This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion. We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z)
Quantifying the Cost of Learning in Queueing Systems [4.784875233446591]
Cost of Learning in Queueing (CLQ) is a new metric that quantifies the maximum increase in time-averaged queue length caused by parameter uncertainty. We propose a unified analysis framework for CLQ that bridges Lyapunov and bandit analysis, provides guarantees for a wide range of algorithms, and could be of independent interest.
arXiv Detail & Related papers (2023-08-15T14:50:12Z)
A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions. Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data. Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.