SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading
- URL: http://arxiv.org/abs/2510.05164v1
- Date: Sat, 04 Oct 2025 19:55:36 GMT
- Title: SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading
- Authors: Yuanzhe Shen, Yide Liu, Zisu Huang, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang,
- Abstract summary: We introduce SATER, a dual-mode compatible approach that fine-tunes models through shortest-response preference optimization and a confidence-aware rejection mechanism.<n> SATER significantly reduces redundant outputs and response times, while improving both the performance of pre-generation routing and the efficiency of cascade routing.
- Score: 39.20076289493037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) demonstrate remarkable performance across diverse tasks, yet their effectiveness frequently depends on costly commercial APIs or cloud services. Model selection thus entails a critical trade-off between performance and cost: high-performing LLMs typically incur substantial expenses, whereas budget-friendly small language models (SLMs) are constrained by limited capabilities. Current research primarily proposes two routing strategies: pre-generation routing and cascade routing. Both approaches have distinct characteristics, with cascade routing typically offering superior cost-effectiveness and accuracy despite its higher latency. To further address the limitations of both approaches, we introduce SATER, a dual-mode compatible approach that fine-tunes models through shortest-response preference optimization and a confidence-aware rejection mechanism. SATER significantly reduces redundant outputs and response times, while improving both the performance of pre-generation routing and the efficiency of cascade routing. Experiments across three SLMs and six datasets, varying in type and complexity, demonstrate that SATER achieves comparable performance while consistently reducing computational costs by over 50\% and cascade latency by over 80\%.
Related papers
- RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents [91.0187958746262]
RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
arXiv Detail & Related papers (2026-01-26T04:22:22Z) - EvoRoute: Experience-Driven Self-Routing LLM Agent Systems [100.64399490164959]
EvoRoute is a self-evolving model routing paradigm that transcends static, pre-defined model assignments.<n> Experiments on challenging agentic benchmarks demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80%$ and latency by over $70%$.
arXiv Detail & Related papers (2026-01-06T04:06:46Z) - One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection [3.872690949369412]
Large language models (LLMs) with varying computational costs and performance profiles present a critical challenge for scalable, cost-effective deployment in real-world applications.<n>We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings.<n>By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers.
arXiv Detail & Related papers (2025-09-11T18:29:09Z) - Dynamic Speculative Agent Planning [57.630218933994534]
Large language-model-based agents face critical deployment challenges due to prohibitive latency and inference costs.<n>We introduce Dynamic Speculative Planning (DSP), an online reinforcement learning framework that provides lossless acceleration with substantially reduced costs.<n>Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest acceleration method while reducing total cost by 30% and unnecessary cost up to 60%.
arXiv Detail & Related papers (2025-09-02T03:34:36Z) - LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead [19.573553157421774]
Light is a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool.<n>Experiments demonstrate that Light matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy.<n>This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.
arXiv Detail & Related papers (2025-05-22T04:46:04Z) - OmniRouter: Budget and Performance Controllable Multi-LLM Routing [31.60019342381251]
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency.<n>We introduce Omni, a controllable routing framework for multi-LLM serving.<n>Experiments show that Omni achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15%.
arXiv Detail & Related papers (2025-02-27T22:35:31Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing [74.14816777318033]
Collaborative Inference with Token-lEvel Routing (CITER) is a framework that enables efficient collaboration between small and large language models.<n>We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation.<n>Our experiments show that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications.
arXiv Detail & Related papers (2025-02-04T03:36:44Z) - RouteLLM: Learning to Route LLMs with Preference Data [41.687640419561504]
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost.<n>We propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference.<n>We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance.
arXiv Detail & Related papers (2024-06-26T18:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.