Cost-of-Pass: An Economic Framework for Evaluating Language Models
- URL: http://arxiv.org/abs/2504.13359v1
- Date: Thu, 17 Apr 2025 21:58:29 GMT
- Title: Cost-of-Pass: An Economic Framework for Evaluating Language Models
- Authors: Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou,
- Abstract summary: We introduce "cost-of-pass", the expected monetary cost of generating a correct solution.<n>We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert"<n>We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks.
- Score: 25.152801302217693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.
Related papers
- Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models [51.85792055455284]
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks.
System 1 reasoning is computationally efficient but leads to suboptimal performance.
System 2 reasoning often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors.
arXiv Detail & Related papers (2025-03-31T17:58:07Z) - Reqo: A Robust and Explainable Query Optimization Cost Model [2.184775414778289]
We propose a tree model architecture based on Bidirectional Graph Neural Networks (Bi-GNN) aggregated by Gated Recurrent Units (GRUs)<n>We implement a novel learning-to-rank cost model that effectively quantifies the uncertainty in cost estimates using approximate probabilistic ML.<n>In addition, we propose the first explainability technique specifically designed for learning-based cost models.
arXiv Detail & Related papers (2025-01-29T04:48:51Z) - Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index [5.714609806192087]
We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption.<n>Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts.
arXiv Detail & Related papers (2024-12-02T16:34:18Z) - An Economic Framework for 6-DoF Grasp Detection [28.25609101289935]
We propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and maintain effective grasp performance.
EconomicGrasp surpasses the SOTA grasp method by about 3AP on average, and with extremely low resource cost.
arXiv Detail & Related papers (2024-07-11T10:19:48Z) - Evolve Cost-aware Acquisition Functions Using Large Language Models [11.209139558885035]
This paper introduces EvolCAF, a novel framework that integrates large language models (LLMs) with evolutionary computation (EC) to automatically design cost-aware AFs.
The designed cost-aware AF maximizes the utilization of available information from historical data, surrogate models and budget details.
In comparison to the well-known EIpu and EI-cool methods designed by human experts, our approach showcases remarkable efficiency and generalization across various tasks.
arXiv Detail & Related papers (2024-04-25T12:19:18Z) - An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities.
Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool.
We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z) - Power Hungry Processing: Watts Driving the Cost of AI Deployment? [74.19749699665216]
generative, multi-purpose AI systems promise a unified approach to building machine learning (ML) models into technology.
This ambition of generality'' comes at a steep cost to the environment, given the amount of energy these systems require and the amount of carbon that they emit.
We measure deployment cost as the amount of energy and carbon required to perform 1,000 inferences on representative benchmark dataset using these models.
We conclude with a discussion around the current trend of deploying multi-purpose generative ML systems, and caution that their utility should be more intentionally weighed against increased costs in terms of energy and emissions
arXiv Detail & Related papers (2023-11-28T15:09:36Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - The Efficiency Misnomer [50.69516433266469]
We discuss common cost indicators, their advantages and disadvantages, and how they can contradict each other.
We demonstrate how incomplete reporting of cost indicators can lead to partial conclusions and a blurred or incomplete picture of the practical considerations of different models.
arXiv Detail & Related papers (2021-10-25T12:48:07Z) - Costs to Consider in Adopting NLP for Your Business [3.608765813727773]
We show the trade-off between performance gain and the cost across the models to give more insights for AI-pivoting business.
We call for more research into low-cost models, especially for under-resourced languages.
arXiv Detail & Related papers (2020-12-16T13:57:31Z) - Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning [100.73223416589596]
We propose a cost-sensitive portfolio selection method with deep reinforcement learning.
Specifically, a novel two-stream portfolio policy network is devised to extract both price series patterns and asset correlations.
A new cost-sensitive reward function is developed to maximize the accumulated return and constrain both costs via reinforcement learning.
arXiv Detail & Related papers (2020-03-06T06:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.