Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
- URL: http://arxiv.org/abs/2602.11877v1
- Date: Thu, 12 Feb 2026 12:28:27 GMT
- Title: Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
- Authors: Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen,
- Abstract summary: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally.<n>We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness.<n>We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet with probabilistic training.
- Score: 46.00150374727385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
Related papers
- Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z) - Federate the Router: Learning Language Model Routers with Sparse and Decentralized Evaluations [26.24858921328445]
Large language models (LLMs) are increasingly accessed as remotely hosted services by edge and enterprise clients.<n>Existing router approaches assume access to centralized query-model evaluation data.<n>We introduce the first federated framework for LLM routing, enabling clients to learn a shared routing policy from local offline query-model evaluation data.
arXiv Detail & Related papers (2026-01-29T21:00:29Z) - CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing [25.48759875572515]
CASTER (Context-Aware Strategy for Task Efficient Routing) is a lightweight router for dynamic model selection in graph-based MAS.<n>CASTER reduces inference cost by up to 72.4% compared to strong-model baselines.
arXiv Detail & Related papers (2026-01-27T16:52:47Z) - ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models [26.059355108708374]
We propose ECVL-ROUTER, the first scenario-aware routing framework for Vision-Language Models (VLMs)<n>Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements.<n>Results show that our approach successfully routes over 80% of queries to the small model while incurring less than 10% drop in problem solving probability.
arXiv Detail & Related papers (2025-10-31T07:46:44Z) - DiSRouter: Distributed Self-Routing for LLM Selections [23.38983740640377]
We introduce DiS (Distributed Self-), a novel paradigm that shifts from centralized control to distributed routing.<n>In DiS, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness.<n>Extensive experiments demonstrate that DiS significantly outperforms existing routing methods in utility across various scenarios.
arXiv Detail & Related papers (2025-10-22T03:36:40Z) - Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs [69.2486294522259]
BaRP is a Bandit Routing-feedback with Preferences approach that trains under the same partial-feedback restriction as deployment.<n> Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt.
arXiv Detail & Related papers (2025-10-08T18:24:59Z) - How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z) - AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning [98.26836657967162]
textbfAgentOhana aggregates agent trajectories from distinct environments, spanning a wide array of scenarios.
textbfxLAM-v0.1, a large action model tailored for AI agents, demonstrates exceptional performance across various benchmarks.
arXiv Detail & Related papers (2024-02-23T18:56:26Z) - Generalized Differentiable RANSAC [95.95627475224231]
$nabla$-RANSAC is a differentiable RANSAC that allows learning the entire randomized robust estimation pipeline.
$nabla$-RANSAC is superior to the state-of-the-art in terms of accuracy while running at a similar speed to its less accurate alternatives.
arXiv Detail & Related papers (2022-12-26T15:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.