Serve Programs, Not Prompts
- URL: http://arxiv.org/abs/2510.25412v1
- Date: Wed, 29 Oct 2025 11:29:03 GMT
- Title: Serve Programs, Not Prompts
- Authors: In Gim, Lin Zhong,
- Abstract summary: We propose a new large language model (LLM) serving system architecture that serves programs instead of prompts to address this problem.<n>We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs.
- Score: 1.285540133357144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.
Related papers
- Pie: A Programmable Serving System for Emerging LLM Applications [3.905272047350447]
Pie is a programmable serving system designed for flexibility and efficiency.<n>It decomposes the traditional generation loop into fine-grained service handlers exposed via an API.<n>It executes inferlets using WebAssembly, benefiting from its lightweight sandboxing.
arXiv Detail & Related papers (2025-10-28T04:17:55Z) - Justitia: Fair and Efficient Scheduling for LLM Applications [32.900257208449716]
We design Justitia, a novel scheduler with three key techniques.<n>Justitia models the service cost of LLM applications in a memory-centric manner.<n>It uses a simple neural network model to conduct light-weight and also accurate demand prediction.
arXiv Detail & Related papers (2025-10-19T21:34:34Z) - Autellix: An Efficient Serving Engine for LLM Agents as General Programs [59.673243129044465]
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs.<n>Existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization.<n>We introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies.
arXiv Detail & Related papers (2025-02-19T18:59:30Z) - Optimizing Token Usage on Large Language Model Conversations Using the Design Structure Matrix [49.1574468325115]
Large Language Models become ubiquitous in many sectors and tasks.
There is a need to reduce token usage, overcoming challenges such as short context windows, limited output sizes, and costs associated with token intake and generation.
This work brings the Design Structure Matrix from the engineering design discipline into LLM conversation optimization.
arXiv Detail & Related papers (2024-10-01T14:38:36Z) - From Commands to Prompts: LLM-based Semantic File System for AIOS [46.29019415676847]
We propose an LLM-based semantic file system ( LSFS) for prompt-driven file management.<n>Unlike conventional approaches, LSFS incorporates LLMs to enable users or agents to interact with files through natural language prompts.<n>Our experiments show that LSFS offers significant improvements over traditional file systems in terms of user convenience, the diversity of supported functions, and the accuracy and efficiency of file operations.
arXiv Detail & Related papers (2024-09-23T08:39:16Z) - Teola: Towards End-to-End Optimization of LLM-based Applications [13.478509565946354]
Large language model (LLM)-based applications contribute to the end-to-end latency.<n>Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions.<n>We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph.
arXiv Detail & Related papers (2024-06-29T05:59:53Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - Preble: Efficient Distributed Prompt Scheduling for LLM Serving [8.706905652975554]
This paper proposes Preble, the first distributed LLM serving platform that targets and optimize for prompt sharing.
We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism.
Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.
arXiv Detail & Related papers (2024-05-08T06:30:58Z) - An LLM Compiler for Parallel Function Calling [68.04566807806071]
We introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function calls.
We observe consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to 9% compared to ReAct.
arXiv Detail & Related papers (2023-12-07T18:32:04Z) - L2MAC: Large Language Model Automatic Computer for Extensive Code Generation [52.81694565226513]
Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture.<n>This paper presents L2MAC, the first practical LLM-based general-purpose stored-program automatic computer (von Neumann architecture) framework, for long and consistent output generation.
arXiv Detail & Related papers (2023-10-02T16:55:19Z) - Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM.
It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses.
We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.