Related papers: Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems

Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems

URL: http://arxiv.org/abs/2507.14715v1
Date: Sat, 19 Jul 2025 18:24:11 GMT
Title: Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems
Authors: Rachid Karami, Rajeev Patwari, Hyoukjun Kwon, Ashish Sirasao,
Abstract summary: Real-time generative AI (RTGen) workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and constraints of real-time inference.<n>Modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures.<n>We show that scheduling decisions significantly affect workload performance.
Score: 0.9041154551329587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of generative AI models, particularly large language models (LLMs), into real-time multi-model AI applications such as video conferencing and gaming is giving rise to a new class of workloads: real-time generative AI (RTGen). These workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and concurrency constraints of real-time inference. To meet the diverse demands of RTGen workloads, modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures that integrate CPUs, GPUs, and NPUs. Despite the potential of heterogeneous SoC, the scheduling space complexity and performance implications of RTGen workloads on such platforms remain underexplored. In this work, we perform a comprehensive characterization of RTGen workloads on AMD's latest heterogeneous SoC, Ryzen AI. We construct realistic multi-model scenarios inspired by industry use cases and profile model performance across all available backends. Using this data, we evaluate five scheduling policies and their impact on both real-time metrics (e.g., deadline violation rate) and LLM performance (e.g., time-to-first-token and tokens-per-second). Our results show that scheduling decisions significantly affect workload performance (e.g., leading to a 41.7% difference in deadline violation rates on average), and highlight the need for scheduling strategies that are aware of workload dynamics and hardware heterogeneity. Our findings underscore the importance of workload-aware, dynamic heterogeneous scheduling in enabling high-performance, on-device RTGen applications.

Related papers

Efficient and Scalable Agentic AI with Heterogeneous Systems [1.8921715645847679]
AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers.<n>To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure.<n>We present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure.
arXiv Detail & Related papers (2025-07-25T19:02:42Z)
Context-Aware CodeLLM Eviction for AI-assisted Coding [6.199193051670653]
AI-assisted coding tools powered by Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development.<n>To address concerns around privacy, latency, and model customization, many enterprises opt to self-host these models.<n>This paper presents CACE, a novel context-aware model eviction strategy designed specifically to optimize self-hosted CodeLLM serving under resource constraints.
arXiv Detail & Related papers (2025-06-23T16:03:32Z)
Understanding and Optimizing Multi-Stage AI Inference Pipelines [11.254219071373319]
HERMES is a Heterogeneous Multi-stage LLM inference Execution Simulator.<n> HERMES supports heterogeneous clients executing multiple models concurrently unlike prior frameworks.<n>We explore the impact of reasoning stages on end-to-end latency, optimal strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval.
arXiv Detail & Related papers (2025-04-14T00:29:49Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
Profiling AI Models: Towards Efficient Computation Offloading in Heterogeneous Edge AI Systems [0.2357055571094446]
We propose a research roadmap focused on profiling AI models, capturing data about model types and underlying hardware to predict resource utilisation and task completion time. Experiments with over 3,000 runs show promise in optimising resource allocation and enhancing Edge AI performance.
arXiv Detail & Related papers (2024-10-30T16:07:14Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph. Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales. We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z)
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices. We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z)
Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs [50.25683648762602]
We introduce Koopman VAE, a new generative framework that is based on a novel design for the model prior. Inspired by Koopman theory, we represent the latent conditional prior dynamics using a linear map. KoVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks.
arXiv Detail & Related papers (2023-10-04T07:14:43Z)
DREAM: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads [8.266680870089997]
We propose a new scheduler, DREAM, which effectively handles various dynamicity in RTMM workloads. DREAM quantifies the unique requirements for RTMM workloads and utilizes the scores quantified to drive scheduling decisions. In our evaluation of five scenarios of RTMM workload, DREAM reduces the overall UXCost by 32.2% and 50.0% in the mean geometric (up to 80.8% and 97.6%) compared to state-of-the-art baselines.
arXiv Detail & Related papers (2022-12-07T02:48:14Z)
A Generative Approach for Production-Aware Industrial Network Traffic Modeling [70.46446906513677]
We investigate the network traffic data generated from a laser cutting machine deployed in a Trumpf factory in Germany. We analyze the traffic statistics, capture the dependencies between the internal states of the machine, and model the network traffic as a production state dependent process. We compare the performance of various generative models including variational autoencoder (VAE), conditional variational autoencoder (CVAE), and generative adversarial network (GAN)
arXiv Detail & Related papers (2022-11-11T09:46:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.