RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents
- URL: http://arxiv.org/abs/2601.18130v1
- Date: Mon, 26 Jan 2026 04:22:22 GMT
- Title: RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents
- Authors: Jize Wang, Han Wu, Zhiyuan You, Yiming Song, Yijun Wang, Zifei Shan, Yining Li, Songyang Zhang, Xinyi Le, Cailian Chen, Xinping Guan, Dacheng Tao,
- Abstract summary: RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
- Score: 91.0187958746262
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mixture-of-Agents (MoA) improves LLM performance through layered collaboration, but its dense topology raises costs and latency. Existing methods employ LLM judges to filter responses, yet still require all models to perform inference before judging, failing to cut costs effectively. They also lack model selection criteria and struggle with large model pools, where full inference is costly and can exceed context limits. To address this, we propose RouteMoA, an efficient mixture-of-agents framework with dynamic routing. It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query, narrowing candidates to a high-potential subset without inference. A mixture of judges then refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference. Finally, a model ranking mechanism selects models by balancing performance, cost, and latency. RouteMoA outperforms MoA across varying tasks and model pool sizes, reducing cost by 89.8% and latency by 63.6% in the large-scale model pool.
Related papers
- Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference [0.0]
"Pyramid MoA" is a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.<n>We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
arXiv Detail & Related papers (2026-02-23T04:47:47Z) - Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning [28.165465162107253]
We propose SCOPE, a routing framework that goes beyond model selection by predicting their cost and performance.<n>SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names.<n>It can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.
arXiv Detail & Related papers (2026-01-29T21:09:36Z) - MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging [72.00014675808228]
textbfMix determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy.<n>Experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning.
arXiv Detail & Related papers (2026-01-25T14:31:57Z) - LLM Routing with Dueling Feedback [49.67815163970033]
We study the problem of selecting the best model for each query while balancing user satisfaction, model expertise, and inference cost.<n>We formulate routing as contextual dueling bandits, learning from pairwise preference feedback rather than absolute scores.<n>We introduce Category-Calibrated Fine-Tuning (CCFT), a representation-learning method that derives model embeddings from offline data using contrastive fine-tuning with categorical weighting.
arXiv Detail & Related papers (2025-10-01T12:52:25Z) - BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute [25.740809143951815]
BEST-Route is a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds.<n> Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
arXiv Detail & Related papers (2025-06-28T01:52:50Z) - Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing [9.217991144854851]
Mixture-of-Experts (MoE) models have been a dominant type of model architectures nowadays.<n>We study optimized MoE model deployment and distributed inference serving on a serverless platform.<n>Our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters.
arXiv Detail & Related papers (2025-01-09T15:29:33Z) - Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.