Related papers: Pie: A Programmable Serving System for Emerging LLM Applications

Pie: A Programmable Serving System for Emerging LLM Applications

URL: http://arxiv.org/abs/2510.24051v1
Date: Tue, 28 Oct 2025 04:17:55 GMT
Title: Pie: A Programmable Serving System for Emerging LLM Applications
Authors: In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong,
Abstract summary: Pie is a programmable serving system designed for flexibility and efficiency.<n>It decomposes the traditional generation loop into fine-grained service handlers exposed via an API.<n>It executes inferlets using WebAssembly, benefiting from its lightweight sandboxing.
Score: 3.905272047350447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

Related papers

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems [49.78267008828593]
Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results.<n>Recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories.<n>We propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks.
arXiv Detail & Related papers (2026-02-05T18:20:21Z)
Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms [33.64527903547734]
Harli improves the finetune throughput by 46.2% on average (up to over state-of-the-art serving systems)<n>Harli improves the finetune throughput by 46.2% on average (up to over state-of-the-art serving systems)
arXiv Detail & Related papers (2025-11-13T05:58:52Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Serve Programs, Not Prompts [1.285540133357144]
We propose a new large language model (LLM) serving system architecture that serves programs instead of prompts to address this problem.<n>We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs.
arXiv Detail & Related papers (2025-10-29T11:29:03Z)
Justitia: Fair and Efficient Scheduling for LLM Applications [32.900257208449716]
We design Justitia, a novel scheduler with three key techniques.<n>Justitia models the service cost of LLM applications in a memory-centric manner.<n>It uses a simple neural network model to conduct light-weight and also accurate demand prediction.
arXiv Detail & Related papers (2025-10-19T21:34:34Z)
Towards Agentic OS: An LLM Agent Framework for Linux Schedulers [3.8068085728995307]
We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement.<n>Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches.
arXiv Detail & Related papers (2025-09-01T08:38:49Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems.<n>A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request.<n>We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z)
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving [61.35068981176018]
ConServe is a large language model (LLM) serving system that achieves high throughput and strong online latency guarantees.<n>We show that ConServe delivers an average of 2.2$times$ higher throughput and reduces online serving tail latency by 2.9$times$ on average compared to state-of-the-art systems.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
Prompt Tuning as User Inherent Profile Inference Machine [68.16976932088708]
We propose UserIP-Tuning, which uses prompt-tuning to infer user profiles.<n>UserIP-Tuning outperforms state-of-the-art recommendation algorithms.<n>The presented solution has been deployed in Huawei AppGallery's Explore page since May 2025.
arXiv Detail & Related papers (2024-08-13T02:25:46Z)
Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs) FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.