Related papers: VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

URL: http://arxiv.org/abs/2509.04827v2
Date: Sun, 14 Sep 2025 07:30:56 GMT
Title: VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
Authors: Jiahuan Yu, Aryan Taneja, Junfeng Lin, Minjia Zhang,
Abstract summary: VoltanaLLM is a system for energy-efficient Large Language Model (LLM) serving.<n>It co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures.<n>It achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate.
Score: 13.494819588196371
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Modern Large Language Model (LLM) serving systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, built from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement VoltanaLLM in SGLang and evaluate its performance over multiple state-of-the-art LLMs and real-world datasets. The results demonstrate that VoltanaLLM achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving. Code of VoltanaLLM is open-sourced on GitHub: https://github.com/Supercomputing-System-AI-Lab/VoltanaLLM.

Related papers

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips [13.816966749411037]
We present SuperInfer, a high-performance Large Model (LLM) inference system designed for emerging Superchips (e.g., NVIDIA GH200)<n>SuperInfer introduces RotaSched, the first proactive, SLOaware rotary scheduler that rotates requests to maintain responsiveness on Superchips.<n>We show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems.
arXiv Detail & Related papers (2026-01-28T07:01:46Z)
Large Language Model (LLM)-enabled Reinforcement Learning for Wireless Network Optimization [79.27012080083603]
Large language models (LLMs) offer promising tools to enhance reinforcement learning in wireless networks.<n>We propose an LLM-assisted state representation and semantic extraction to enhance the multi-agent reinforcement learning framework.
arXiv Detail & Related papers (2026-01-15T01:42:39Z)
Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation [52.11339614452127]
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions.<n>Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities.<n>We propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner.
arXiv Detail & Related papers (2025-11-18T04:32:00Z)
BeLLMan: Controlling LLM Congestion [1.6728793271113227]
Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load.<n>Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load.
arXiv Detail & Related papers (2025-10-17T05:36:42Z)
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering [55.56674028743782]
Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time.<n>We present EasySteer, a unified framework for high-performance, LLM steering built on vLLM.
arXiv Detail & Related papers (2025-09-29T17:59:07Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts [18.479200918676575]
Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services.<n>However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns.<n>This paper proposes a novel deep reinforcement learning-based routing framework for sustained high-quality LLM services.
arXiv Detail & Related papers (2025-08-01T00:45:15Z)
Energy-Aware LLMs: A step towards sustainable AI for downstream applications [0.9012198585960441]
Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks.<n>LLMs typically require huge computational resources, resulting in terribly high energy consumption.<n>This research study proposes an end-to-end pipeline that investigates the trade-off between energy efficiency and model performance.
arXiv Detail & Related papers (2025-03-22T14:28:29Z)
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding [12.106234303559571]
We present AdaServe, the first serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding.<n>AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm.<n>It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput.
arXiv Detail & Related papers (2025-01-21T14:15:01Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
Queue management for slo-oriented large language model serving [3.0134961904579094]
We propose QLM, a queue management system for large language model (LLM) serving.<n>QLM maintains batch and interactive requests across different models and SLOs in a request queue.<n>It uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue.
arXiv Detail & Related papers (2024-06-05T21:17:34Z)
Function Approximation for Reinforcement Learning Controller for Energy from Spread Waves [69.9104427437916]
Multi-generator Wave Energy Converters (WEC) must handle multiple simultaneous waves coming from different directions called spread waves. These complex devices need controllers with multiple objectives of energy capture efficiency, reduction of structural stress to limit maintenance, and proactive protection against high waves. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics.
arXiv Detail & Related papers (2024-04-17T02:04:10Z)
Communication-Efficient Orchestrations for URLLC Service via Hierarchical Reinforcement Learning [14.604814002402588]
We propose a multi-agent Hierarchical RL (HRL) framework that enables the implementation of multi-level policies with different control loop timescales. On a use case from the prior art, with our HRL framework, we optimized the maximum number of retransmissions and transmission power of industrial devices.
arXiv Detail & Related papers (2023-07-25T11:23:38Z)
Power and Interference Control for VLC-Based UDN: A Reinforcement Learning Approach [10.576175218005046]
An ultra-dense network (UDN) technology can be adopted to expand the capacity of the Visible Light Communication (VLC) network. The deployment of the cells is optimized via spatial reuse to mitigate ICI. An RL-based algorithm is proposed to dynamically optimize the policy of power and interference control.
arXiv Detail & Related papers (2023-03-09T17:46:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.