Related papers: Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing

Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing

URL: http://arxiv.org/abs/2509.25535v1
Date: Mon, 29 Sep 2025 21:44:00 GMT
Title: Meta-Router: Bridging Gold-standard and Preference-based Evaluations in Large Language Model Routing
Authors: Yichi Zhang, Fangzheng Xie, Shu Yang, Chong Wu,
Abstract summary: A large language model (LLM) router selects the most appropriate model from a pool of candidates for each query.<n> preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses.<n>We develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency.
Score: 15.724480880994259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In language tasks that require extensive human--model interaction, deploying a single "best" model for every query can be expensive. To reduce inference cost while preserving the quality of the responses, a large language model (LLM) router selects the most appropriate model from a pool of candidates for each query. A central challenge to training a high-quality router is the scarcity of reliable supervision. Gold-standard data (e.g., expert-verified labels or rubric-based scores) provide accurate quality evaluations of LLM responses but are costly and difficult to scale. In contrast, preference-based data, collected via crowdsourcing or LLM-as-a-judge systems, are cheaper and more scalable, yet often biased in reflecting the true quality of responses. We cast the problem of LLM router training with combined gold-standard and preference-based data into a causal inference framework by viewing the response evaluation mechanism as the treatment assignment. This perspective further reveals that the bias in preference-based data corresponds to the well-known causal estimand: the conditional average treatment effect. Based on this new perspective, we develop an integrative causal router training framework that corrects preference-data bias, address imbalances between two data sources, and improve routing robustness and efficiency. Numerical experiments demonstrate that our approach delivers more accurate routing and improves the trade-off between cost and quality.

Related papers

Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems [54.916243942641444]
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications.<n>We study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline.
arXiv Detail & Related papers (2025-12-23T03:10:09Z)
Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs [69.2486294522259]
BaRP is a Bandit Routing-feedback with Preferences approach that trains under the same partial-feedback restriction as deployment.<n> Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt.
arXiv Detail & Related papers (2025-10-08T18:24:59Z)
Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation [55.47971671635531]
Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA)<n>Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge.<n>Existing systems primarily rely on unstructured documents, while largely overlooking relational databases.
arXiv Detail & Related papers (2025-09-30T22:19:44Z)
One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection [3.872690949369412]
Large language models (LLMs) with varying computational costs and performance profiles present a critical challenge for scalable, cost-effective deployment in real-world applications.<n>We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings.<n>By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers.
arXiv Detail & Related papers (2025-09-11T18:29:09Z)
Federated In-Context Learning: Iterative Refinement for Improved Answer Quality [62.72381208029899]
In-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input.<n>We propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process.<n>Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters.
arXiv Detail & Related papers (2025-06-09T05:33:28Z)
Addressing Data Quality Decompensation in Federated Learning via Dynamic Client Selection [7.603415982653868]
Shapley-Bid Reputation Optimized Federated Learning (SBRO-FL) is a unified framework integrating dynamic bidding, reputation modeling, and cost-aware selection.<n>A reputation system, inspired by prospect theory, captures historical performance while penalizing inconsistency.<n>Experiments on FashionMNIST, EMNIST, CIFAR-10, and SVHN datasets show that SBRO-FL improves accuracy, convergence speed, and robustness, even in adversarial and low-bid interference scenarios.
arXiv Detail & Related papers (2025-05-27T14:06:51Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
Leveraging Uncertainty Estimation for Efficient LLM Routing [20.67188754368684]
Deploying large language models (LLMs) in edge-cloud environments requires an efficient routing strategy to balance cost and response quality.<n>Traditional approaches prioritize either human-preference data or accuracy metrics from benchmark datasets as routing criteria.<n>We propose the Confidence-Driven LLM Router, a novel framework that leverages uncertainty estimation to optimize routing decisions.
arXiv Detail & Related papers (2025-02-16T07:08:47Z)
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [74.14816777318033]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models.<n>We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.<n>Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z)
RouteLLM: Learning to Route LLMs with Preference Data [41.687640419561504]
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost.<n>We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference.<n>We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
arXiv Detail & Related papers (2024-06-26T18:10:22Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.