Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
- URL: http://arxiv.org/abs/2508.12631v2
- Date: Wed, 22 Oct 2025 02:23:31 GMT
- Title: Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
- Authors: Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu,
- Abstract summary: Avengers-Pro is a test-time routing framework for large language models.<n>It ensembles LLMs of varying capacities and efficiencies.<n>It can surpass the strongest single model by +7% in average accuracy.
- Score: 33.70788437189248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.
Related papers
- RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents [91.0187958746262]
RouteMoA is an efficient mixture-of-agents framework with dynamic routing.<n>It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query.<n>It refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference.
arXiv Detail & Related papers (2026-01-26T04:22:22Z) - LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost [1.7133809948345597]
We present a production-ready framework for evaluating acceptance tests with structured evaluations.<n>We provide the first comprehensive analysis spanning accuracy, operational reliability, and cost.<n>We release the dataset, framework, and code to support and deployment.
arXiv Detail & Related papers (2025-12-01T03:19:33Z) - Performance of GPT-5 Frontier Models in Ophthalmology Question Answering [6.225411871775591]
Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on medical question-answering tasks.<n>We evaluated 12 configurations of OpenAI's GPT-5 series alongside o1-high, o3-high, and GPT-4o.<n> GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high)<n>These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of
arXiv Detail & Related papers (2025-08-13T17:17:17Z) - Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services [9.687789919349523]
We propose Fremer, an efficient and effective deep forecasting model.<n>Fremer fulfills three critical requirements: it demonstrates superior efficiency, outperforming most Transformer-based forecasting models.<n>It achieves exceptional accuracy, surpassing all state-of-the-art (SOTA) models in workload forecasting.
arXiv Detail & Related papers (2025-07-17T08:51:28Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - Controllable Prompt Tuning For Balancing Group Distributional Robustness [53.336515056479705]
We introduce an optimization scheme to achieve good performance across groups and find a good solution for all without severely sacrificing performance on any of them.
We propose Controllable Prompt Tuning (CPT), which couples our approach with prompt-tuning techniques.
On spurious correlation benchmarks, our procedures achieve state-of-the-art results across both transformer and non-transformer architectures, as well as unimodal and multimodal data.
arXiv Detail & Related papers (2024-03-05T06:23:55Z) - Astraios: Parameter-Efficient Instruction Tuning Code Large Language
Models [21.17021844323919]
We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters.
We find that FFT leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale.
arXiv Detail & Related papers (2024-01-01T15:30:19Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [56.88751562302793]
Low-rank adaption (LoRA) has emerged to fine-tune large language models (LLMs)
LoRAPrune is a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner.
LoRAPrune achieves a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%.
arXiv Detail & Related papers (2023-05-28T15:15:48Z) - Ensemble of Averages: Improving Model Selection and Boosting Performance
in Domain Generalization [63.28279815753543]
In Domain Generalization (DG) settings, models trained on a given set of training domains have notoriously chaotic performance on shifted test domains.
We first show that a simple protocol for averaging model parameters along the optimization path, starting early during training, significantly boosts domain generalizationity.
We show that an ensemble of independently trained models also has a chaotic behavior in the DG setting.
arXiv Detail & Related papers (2021-10-21T00:08:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.