Related papers: Circinus: Efficient Query Planner for Compound ML Serving

Circinus: Efficient Query Planner for Compound ML Serving

URL: http://arxiv.org/abs/2504.16397v1
Date: Wed, 23 Apr 2025 03:57:24 GMT
Title: Circinus: Efficient Query Planner for Compound ML Serving
Authors: Banruo Liu, Wei-Yu Lin, Minghao Fang, Yihan Jiang, Fan Lai,
Abstract summary: This paper presents Circinus, an SLO-aware query planner for large-scale compound AI workloads.<n>By exploiting plan similarities within and across queries, Circinus significantly reduces search steps.<n> Evaluations show that Circinus improves service goodput by 3.2-5.0$times$, accelerates query planning by 4.2-5.8$times$, achieving query response in seconds.
Score: 3.6295638972280733
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rise of compound AI serving -- integrating multiple operators in a pipeline that may span edge and cloud tiers -- enables end-user applications such as autonomous driving, generative AI-powered meeting companions, and immersive gaming. Achieving high service goodput -- i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs -- requires effective planning of operator placement, configuration, and resource allocation across infrastructure tiers. However, the diverse SLO requirements, varying edge capabilities, and high query volumes create an enormous planning search space, rendering current solutions fundamentally limited for real-time serving and cost-efficient deployments. This paper presents Circinus, an SLO-aware query planner for large-scale compound AI workloads. Circinus novelly decomposes multi-query planning and multi-dimensional SLO objectives while preserving global decision quality. By exploiting plan similarities within and across queries, it significantly reduces search steps. It further improves per-step efficiency with a precision-aware plan profiler that incrementally profiles and strategically applies early stopping based on imprecise estimates of plan performance. At scale, Circinus selects query-plan combinations to maximize global SLO goodput. Evaluations in real-world settings show that Circinus improves service goodput by 3.2-5.0$\times$, accelerates query planning by 4.2-5.8$\times$, achieving query response in seconds, while reducing deployment costs by 3.2-4.0$\times$ over state of the arts even in their intended single-tier deployments.

Related papers

Smart Routing: Cost-Effective Multi-LLM Serving for Multi-Core AIOS [31.60019342381251]
Existing scheduling frameworks mainly target at latency optimization.<n>This paper proposes an efficient capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving.
arXiv Detail & Related papers (2025-02-27T22:35:31Z)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [56.98081258047281]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models.<n>We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.<n>Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z)
Automating High Quality RT Planning at Scale [4.660056689223253]
We introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually.
arXiv Detail & Related papers (2025-01-21T00:44:18Z)
Distilling Multi-modal Large Language Models for Autonomous Driving [64.63127269187814]
Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events.<n>We propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM.<n>Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios.
arXiv Detail & Related papers (2025-01-16T18:59:53Z)
SCoTT: Wireless-Aware Path Planning with Vision Language Models and Strategic Chains-of-Thought [78.53885607559958]
A novel approach using vision language models (VLMs) is proposed for enabling path planning in complex wireless-aware environments.<n>To this end, insights from a digital twin with real-world wireless ray tracing data are explored.<n>Results show that SCoTT achieves very close average path gains compared to DP-WA* while at the same time yielding consistently shorter path lengths.
arXiv Detail & Related papers (2024-11-27T10:45:49Z)
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning [43.13654681136326]
We propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework for cost-aware tool planning. LLM incorporates a tool planning language to enhance the LLM to generate non-sequential plans of multiple branches for efficient concurrent tool execution and cost reduction. Experiments on OpenCATP show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone.
arXiv Detail & Related papers (2024-11-25T12:05:49Z)
Tree-Planner: Efficient Close-loop Task Planning with Large Language Models [63.06270302774049]
Tree-Planner reframes task planning with Large Language Models into three distinct phases. Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.
arXiv Detail & Related papers (2023-10-12T17:59:50Z)
AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks. We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities.
arXiv Detail & Related papers (2023-05-26T05:52:27Z)
Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems [0.0]
InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO. It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
arXiv Detail & Related papers (2023-04-21T11:19:49Z)
A Framework for Neurosymbolic Robot Action Planning using Large Language Models [3.0501524254444767]
We present a framework aimed at bridging the gap between symbolic task planning and machine learning approaches. The rationale is training Large Language Models (LLMs) into a neurosymbolic task planner compatible with the Planning Domain Definition Language (PDDL) Preliminary results in selected domains show that our method can: (i) solve 95.5% of problems in a test data set of 1,000 samples; (ii) produce plans up to 13.5% shorter than a traditional symbolic planner; (iii) reduce average overall waiting times for a plan availability by up to 61.4%.
arXiv Detail & Related papers (2023-03-01T11:54:22Z)
Innovations in the field of on-board scheduling technologies [64.41511459132334]
This paper proposes an onboard scheduler, that integrates inside an onboard software framework for mission autonomy. The scheduler is based on linear integer programming and relies on the use of a branch-and-cut solver. The technology has been tested on an Earth Observation scenario, comparing its performance against the state-of-the-art scheduling technology.
arXiv Detail & Related papers (2022-05-04T12:00:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.