Related papers: Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents

Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents

URL: http://arxiv.org/abs/2512.14142v1
Date: Tue, 16 Dec 2025 06:55:10 GMT
Title: Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents
Authors: Hongqiu Ni, Jiabao Zhang, Guopeng Li, Zilong Wang, Ruiqi Wu, Chi Zhang, Haisheng Tan,
Abstract summary: Astraea is a service engine designed to shift the optimization from local segments to the global request lifecycle.<n>It employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions.<n>Astraea reduces average JCT by up to 25.5% compared to baseline methods.
Score: 12.884297990127985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5\% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.

Related papers

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z)
Asynchronous MultiAgent Reinforcement Learning for 5G Routing under Side Constraints [1.0732935873226022]
We propose an asynchronous multi-agent reinforcement learning framework in which independent PPO agents plan routes in parallel and commit resource deltas to a shared global resource environment.<n>We evaluate the method on an O-RAN like network simulation using nearly real-time traffic data from the city of Montreal.<n>AMARL achieves a similar Grade of Service (GoS) and end-to-end latency, with reduced training wall-clock time and improved robustness to demand shifts.
arXiv Detail & Related papers (2026-01-18T18:38:37Z)
Towards Efficient Agents: A Co-Design of Inference Architecture and System [66.59916327634639]
This paper presents AgentInfer, a unified framework for end-to-end agent acceleration.<n>We decompose the problem into four synergistic components: AgentCollab, AgentSched, AgentSAM, and AgentCompress.<n>Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%.
arXiv Detail & Related papers (2025-12-20T12:06:13Z)
Deep Q-Learning-Based Intelligent Scheduling for ETL Optimization in Heterogeneous Data Environments [10.31577390735368]
This paper proposes an intelligent scheduling optimization framework based on deep Q-learning.<n>The framework formalizes the scheduling process as a Markov Decision Process.<n>It enables adaptive decision-making by a reinforcement learning agent in high-dimensional state spaces.
arXiv Detail & Related papers (2025-12-15T07:38:47Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference [0.0]
Slim Scheduler integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models.<n>This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput.
arXiv Detail & Related papers (2025-10-10T05:44:05Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
A Bayesian Framework of Deep Reinforcement Learning for Joint O-RAN/MEC Orchestration [12.914011030970814]
Multi-access Edge Computing (MEC) can be implemented together with Open Radio Access Network (O-RAN) over commodity platforms to offer low-cost deployment. In this paper, a joint O-RAN/MEC orchestration using a Bayesian deep reinforcement learning (RL)-based framework is proposed.
arXiv Detail & Related papers (2023-12-26T18:04:49Z)
A Deep Recurrent-Reinforcement Learning Method for Intelligent AutoScaling of Serverless Functions [18.36339203254509]
F introduces a lightweight, function-based cloud execution model that finds its relevance in a range of applications like IoT-edge data processing and anomaly detection.
arXiv Detail & Related papers (2023-08-11T04:41:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.