Related papers: Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

URL: http://arxiv.org/abs/2503.20074v2
Date: Thu, 27 Mar 2025 17:16:44 GMT
Title: Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience
Authors: Yahav Biran, Imry Kissos,
Abstract summary: This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators.<n>The framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators.
Score: 0.46040036610482665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.

Related papers

EvoRoute: Experience-Driven Self-Routing LLM Agent Systems [100.64399490164959]
EvoRoute is a self-evolving model routing paradigm that transcends static, pre-defined model assignments.<n> Experiments on challenging agentic benchmarks demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80%$ and latency by over $70%$.
arXiv Detail & Related papers (2026-01-06T04:06:46Z)
AgentEvolver: Towards Efficient Self-Evolving Agent System [51.54882384204726]
We present AgentEvolver, a self-evolving agent system that drives autonomous agent learning.<n>AgentEvolver introduces three synergistic mechanisms: self-questioning, self-navigating, and self-attributing.<n>Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
arXiv Detail & Related papers (2025-11-13T15:14:47Z)
Dynamic Speculative Agent Planning [57.630218933994534]
Large language-model-based agents face critical deployment challenges due to prohibitive latency and inference costs.<n>We introduce Dynamic Speculative Planning (DSP), an online reinforcement learning framework that provides lossless acceleration with substantially reduced costs.<n>Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest acceleration method while reducing total cost by 30% and unnecessary cost up to 60%.
arXiv Detail & Related papers (2025-09-02T03:34:36Z)
PowerGrow: Feasible Co-Growth of Structures and Dynamics for Power Grid Synthesis [75.14189839277928]
We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity.<n> Experiments across benchmark settings show that PowerGrow outperforms prior diffusion models in fidelity and diversity.<n>This demonstrates its ability to generate operationally valid and realistic power grid scenarios.
arXiv Detail & Related papers (2025-08-29T01:47:27Z)
The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [3.0868637098088403]
Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning.<n>This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and test-time scaling strategies.<n>Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs.
arXiv Detail & Related papers (2025-06-04T14:37:54Z)
Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling [1.3689475854650441]
This study proposes a comprehensive scalability optimization framework for cloud AI inference services. The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting. Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35 and reduces response delay by 28.
arXiv Detail & Related papers (2025-04-16T04:00:04Z)
Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and Challenges [19.390215975410406]
Autonomous edge computing in robotics, smart cities, and autonomous vehicles relies on seamless integration of sensing, processing, and actuation.<n>At its core is the sensing-to-action loop, which iteratively aligns sensor inputs with computational models to drive adaptive control strategies.<n>This article explores how proactive, context-aware sensing-to-action and action-to-sensing adaptations can enhance efficiency.
arXiv Detail & Related papers (2025-02-04T20:13:58Z)
Neural Horizon Model Predictive Control -- Increasing Computational Efficiency with Neural Networks [0.0]
We propose a proposed machine-learning supported approach to model predictive control. We propose approximating part of the problem horizon, while maintaining safety guarantees. The proposed MPC scheme can be applied to a wide range of applications, including those requiring a rapid control response.
arXiv Detail & Related papers (2024-08-19T08:13:37Z)
Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance. Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z)
Exploration of Activation Fault Reliability in Quantized Systolic Array-Based DNN Accelerators [0.8796261172196743]
This paper presents a comprehensive methodology for exploring and enabling a holistic assessment of the impact of quantization on model accuracy, activation fault reliability, and hardware efficiency. A fully automated framework is introduced that is capable of applying various quantization-aware techniques, fault injection, and hardware implementation. The experiments on established benchmarks demonstrate the analysis flow and the profound implications of quantization on reliability, hardware performance, and network accuracy.
arXiv Detail & Related papers (2024-01-17T12:55:17Z)
Real-time Control of Electric Autonomous Mobility-on-Demand Systems via Graph Reinforcement Learning [14.073588678179865]
Electric Autonomous Mobility-on-Demand (E-AMoD) fleets need to make several real-time decisions. We present the E-AMoD control problem through the lens of reinforcement learning. We propose a graph network-based framework to achieve drastically improved scalability and superior performance overoptimals.
arXiv Detail & Related papers (2023-11-09T22:57:21Z)
Multi-Objective Optimization for UAV Swarm-Assisted IoT with Virtual Antenna Arrays [55.736718475856726]
Unmanned aerial vehicle (UAV) network is a promising technology for assisting Internet-of-Things (IoT) Existing UAV-assisted data harvesting and dissemination schemes require UAVs to frequently fly between the IoTs and access points. We introduce collaborative beamforming into IoTs and UAVs simultaneously to achieve energy and time-efficient data harvesting and dissemination.
arXiv Detail & Related papers (2023-08-03T02:49:50Z)
Elastic Entangled Pair and Qubit Resource Management in Quantum Cloud Computing [73.7522199491117]
Quantum cloud computing (QCC) offers a promising approach to efficiently provide quantum computing resources. The fluctuations in user demand and quantum circuit requirements are challenging for efficient resource provisioning. We propose a resource allocation model to provision quantum computing and networking resources.
arXiv Detail & Related papers (2023-07-25T00:38:46Z)
Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption. Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy. We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z)
Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via Conformal Prediction [72.59079526765487]
The dynamic scheduling of ultra-reliable and low-latency traffic (URLLC) in the uplink can significantly enhance the efficiency of coexisting services. The main challenge is posed by the uncertainty in the process of URLLC packet generation. We introduce a novel scheduler for URLLC packets that provides formal guarantees on reliability and latency irrespective of the quality of the URLLC traffic predictor.
arXiv Detail & Related papers (2023-02-15T14:09:55Z)
Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.