Related papers: Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

URL: http://arxiv.org/abs/2602.11877v1
Date: Thu, 12 Feb 2026 12:28:27 GMT
Title: Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
Authors: Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen,
Abstract summary: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally.<n>We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness.<n>We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet with probabilistic training.
Score: 46.00150374727385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

Related papers

Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance [92.72779885657373]
We propose a framework that grounds model selection in the internal functional dynamics of the visual encoder.<n>Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment.<n>Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target's salient functional blocks.
arXiv Detail & Related papers (2026-02-01T17:29:43Z)
Federate the Router: Learning Language Model Routers with Sparse and Decentralized Evaluations [26.24858921328445]
Large language models (LLMs) are increasingly accessed as remotely hosted services by edge and enterprise clients.<n>Existing router approaches assume access to centralized query-model evaluation data.<n>We introduce the first federated framework for LLM routing, enabling clients to learn a shared routing policy from local offline query-model evaluation data.
arXiv Detail & Related papers (2026-01-29T21:00:29Z)
CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing [25.48759875572515]
CASTER (Context-Aware Strategy for Task Efficient Routing) is a lightweight router for dynamic model selection in graph-based MAS.<n>CASTER reduces inference cost by up to 72.4% compared to strong-model baselines.
arXiv Detail & Related papers (2026-01-27T16:52:47Z)
ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models [26.059355108708374]
We propose ECVL-ROUTER, the first scenario-aware routing framework for Vision-Language Models (VLMs)<n>Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements.<n>Results show that our approach successfully routes over 80% of queries to the small model while incurring less than 10% drop in problem solving probability.
arXiv Detail & Related papers (2025-10-31T07:46:44Z)
DiSRouter: Distributed Self-Routing for LLM Selections [23.38983740640377]
We introduce DiS (Distributed Self-), a novel paradigm that shifts from centralized control to distributed routing.<n>In DiS, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness.<n>Extensive experiments demonstrate that DiS significantly outperforms existing routing methods in utility across various scenarios.
arXiv Detail & Related papers (2025-10-22T03:36:40Z)
Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs [69.2486294522259]
BaRP is a Bandit Routing-feedback with Preferences approach that trains under the same partial-feedback restriction as deployment.<n> Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt.
arXiv Detail & Related papers (2025-10-08T18:24:59Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning [98.26836657967162]
textbfAgentOhana aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. textbfxLAM-v0.1, a large action model tailored for AI agents, demonstrates exceptional performance across various benchmarks.
arXiv Detail & Related papers (2024-02-23T18:56:26Z)
Generalized Differentiable RANSAC [95.95627475224231]
$nabla$-RANSAC is a differentiable RANSAC that allows learning the entire randomized robust estimation pipeline. $nabla$-RANSAC is superior to the state-of-the-art in terms of accuracy while running at a similar speed to its less accurate alternatives.
arXiv Detail & Related papers (2022-12-26T15:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.