Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap
- URL: http://arxiv.org/abs/2512.18126v1
- Date: Fri, 19 Dec 2025 23:06:58 GMT
- Title: Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap
- Authors: Zijun Wang, Yijiahao Qi, Hanqiu Chen, Zishen Wan, Gongjin Sun, Dongyang Li, Shuyi Pei, Cong Hao,
- Abstract summary: Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization.<n>We present a serving design that targets these bottlenecks through an algorithm-system co-design.
- Score: 15.352230356342366
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization, which jointly inflate serving latency. We present a serving design that targets these bottlenecks through an algorithm-system co-design. First, we replace dense agent interaction graphs with a hierarchical tree topology that induces structured sparsity in inter-agent communication. Second, we introduce a runtime adaptive mechanism that selectively terminates or skips downstream agent invocations using semantic agreement and confidence signals from intermediate outputs. Third, we pipeline agent execution by overlapping incremental prefilling with decoding across dependency-related agents, improving utilization and reducing inference latency. Across representative tasks, this approach substantially reduces end-to-end latency (up to 90%) while maintaining comparable accuracy (within $\pm$1%) relative to dense-connectivity MoA baselines, and can improve accuracy in certain settings.
Related papers
- DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z) - ComAgent: Multi-LLM based Agentic AI Empowered Intelligent Wireless Networks [62.031889234230725]
6G networks rely on complex cross-layer optimization.<n> manually translating high-level intents into mathematical formulations remains a bottleneck.<n>We present ComAgent, a multi-LLM agentic AI framework.
arXiv Detail & Related papers (2026-01-27T13:43:59Z) - Towards Efficient Agents: A Co-Design of Inference Architecture and System [66.59916327634639]
This paper presents AgentInfer, a unified framework for end-to-end agent acceleration.<n>We decompose the problem into four synergistic components: AgentCollab, AgentSched, AgentSAM, and AgentCompress.<n>Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%.
arXiv Detail & Related papers (2025-12-20T12:06:13Z) - InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs [72.5651722107621]
InterAgent is an end-to-end framework for text-driven physics-based multi-agent humanoid control.<n>We introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to cross-modal interference.<n>We also propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies.
arXiv Detail & Related papers (2025-12-08T10:46:01Z) - Distributed Dynamic Associative Memory via Online Convex Optimization [42.94410959330529]
An associative memory (AM) enables cue-response recall, and it has recently been recognized as a key mechanism underlying modern neural architectures such as Transformers.<n>We introduce the concept of distributed dynamic associative memory (DDAM), which extends classical AM to settings with multiple agents and time-varying data streams.<n>In DDAM, each agent maintains a local AM that must not only store its own associations but also selectively memorize information from other agents based on a specified interest matrix.<n>We propose a novel tree-based distributed gradient online descent algorithm, termed DDAM-TOGD, which enables each agent to
arXiv Detail & Related papers (2025-11-28T16:56:18Z) - JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception [7.867653563872962]
JigsawComm is an end-to-end trained, semantic-aware, and communication-efficient CP framework.<n>It uses a regularized encoder to extract semantically-relevant and sparse features.<n>It uses a lightweight Feature Utility Estimator to predict the contribution of each agent's features to the final perception task.
arXiv Detail & Related papers (2025-11-21T23:36:24Z) - AgentAsk: Multi-Agent Systems Need to Ask [26.13279490836716]
Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor.<n>We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation.<n>AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%.
arXiv Detail & Related papers (2025-10-08T22:36:05Z) - When Agents Break Down in Multiagent Path Finding [0.0]
We introduce a new variant that formally models scenarios where some agents may experience delays due to malfunctions.<n>We propose a framework for dynamic schedule adaptation that does not rely on full replanning.<n>We prove that following our primary communication protocol, the increase in makespan after k malfunctions is bounded by k additional turns.
arXiv Detail & Related papers (2025-08-05T12:59:30Z) - DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Compressed Regression over Adaptive Networks [58.79251288443156]
We derive the performance achievable by a network of distributed agents that solve, adaptively and in the presence of communication constraints, a regression problem.
We devise an optimized allocation strategy where the parameters necessary for the optimization can be learned online by the agents.
arXiv Detail & Related papers (2023-04-07T13:41:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.