Efficient and Scalable Agentic AI with Heterogeneous Systems
- URL: http://arxiv.org/abs/2507.19635v1
- Date: Fri, 25 Jul 2025 19:02:42 GMT
- Title: Efficient and Scalable Agentic AI with Heterogeneous Systems
- Authors: Zain Asgar, Michelle Nguyen, Sachin Katti,
- Abstract summary: AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers.<n>To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure.<n>We present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure.
- Score: 1.8921715645847679
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.
Related papers
- Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation [72.44384066166147]
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains.<n>Existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures.<n>We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch.
arXiv Detail & Related papers (2025-07-24T09:17:41Z) - Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems [0.9041154551329587]
Real-time generative AI (RTGen) workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and constraints of real-time inference.<n>Modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures.<n>We show that scheduling decisions significantly affect workload performance.
arXiv Detail & Related papers (2025-07-19T18:24:11Z) - Beyond Connectivity: An Open Architecture for AI-RAN Convergence in 6G [20.07205081315289]
This article presents a novel converged O-RAN and AI-RAN architecture that unifies orchestration and management of both telecommunications and AI workloads on shared infrastructure.<n>We introduce two key architectural innovations: (i) the AI-RAN Orchestrator, which extends the O-RAN Service Management and Orchestration (SMO) to enable integrated resource and allocation across RAN and AI workloads; and (ii) AI-RAN sites that provide distributed edge AI platforms with real-time processing capabilities.
arXiv Detail & Related papers (2025-07-09T14:49:11Z) - Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities [117.49715661395294]
Data structurization can play a promising role by transforming intricate and disorganized data into well-structured forms.<n>This survey presents a first systematic review of how graphs can empower AI agents.
arXiv Detail & Related papers (2025-06-22T12:59:12Z) - KAITIAN: A Unified Communication Framework for Enabling Efficient Collaboration Across Heterogeneous Accelerators in Embodied AI Systems [5.241889216655924]
KAITIAN is a novel distributed communication framework for AI workloads.<n>It integrates vendor-optimized communication libraries for intra-group efficiency with general-purpose communication protocols for inter-group interoperability.<n>It can accelerate training time by up to 42% compared to baseline homogeneous systems.
arXiv Detail & Related papers (2025-05-15T11:29:43Z) - Understanding and Optimizing Multi-Stage AI Inference Pipelines [11.254219071373319]
HERMES is a Heterogeneous Multi-stage LLM inference Execution Simulator.<n> HERMES supports heterogeneous clients executing multiple models concurrently unlike prior frameworks.<n>We explore the impact of reasoning stages on end-to-end latency, optimal strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval.
arXiv Detail & Related papers (2025-04-14T00:29:49Z) - A representational framework for learning and encoding structurally enriched trajectories in complex agent environments [1.904851064759821]
The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios.<n>One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them, such as disentangled representations that exploit symmetries.<n>We propose to enrich the agent's ontology and extend the traditionalisation of trajectories to provide a more nuanced view of task execution.
arXiv Detail & Related papers (2025-03-17T14:04:27Z) - AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence [79.5316642687565]
Existing multi-agent frameworks often struggle with integrating diverse capable third-party agents.
We propose the Internet of Agents (IoA), a novel framework that addresses these limitations.
IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control.
arXiv Detail & Related papers (2024-07-09T17:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.