Related papers: Cost-Optimal Active AI Model Evaluation

Cost-Optimal Active AI Model Evaluation

URL: http://arxiv.org/abs/2506.07949v1
Date: Mon, 09 Jun 2025 17:14:41 GMT
Title: Cost-Optimal Active AI Model Evaluation
Authors: Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, Adam Fisch,
Abstract summary: Development of generative AI systems requires continual evaluation, data acquisition, and annotation.<n>We develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater.<n>We derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters.
Score: 71.2069549142394
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater -- such as a model-based autorater that is designed to automatically assess the quality of generated content -- with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target "strong" rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical inference, we derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Using synthetic and real-world data, we empirically characterize the conditions under which these policies yield improvements over prior methods. We find that, especially in tasks where there is high variability in the difficulty of examples, our policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods.

Related papers

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees [36.407171992845456]
We propose textttR-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation.<n>The key innovation of textttR-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data.
arXiv Detail & Related papers (2025-05-24T11:53:29Z)
Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index [5.714609806192087]
We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption.<n>Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts.
arXiv Detail & Related papers (2024-12-02T16:34:18Z)
Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Balanced Off-Policy Evaluation for Personalized Pricing [3.296526804364952]
We consider a personalized pricing problem in which we have data consisting of feature information, historical pricing decisions, and binary realized demand. The goal is to perform off-policy evaluation for a new personalized pricing policy that maps features to prices. Building on the balanced policy evaluation framework of Kallus, we propose a new approach tailored to pricing applications.
arXiv Detail & Related papers (2023-02-24T16:44:46Z)
Personalized Pricing with Invalid Instrumental Variables: Identification, Estimation, and Policy Learning [5.372349090093469]
This work studies offline personalized pricing under endogeneity using an instrumental variable approach. We propose a new policy learning method for Personalized pRicing using Invalid iNsTrumental variables.
arXiv Detail & Related papers (2023-02-24T14:50:47Z)
Latent State Marginalization as a Low-cost Approach for Improving Exploration [79.12247903178934]
We propose the adoption of latent variable policies within the MaxEnt framework. We show that latent variable policies naturally emerges under the use of world models with a latent belief state. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.
arXiv Detail & Related papers (2022-10-03T15:09:12Z)
Stream-based Active Learning with Verification Latency in Non-stationary Environments [6.883906273999368]
We investigate the influence of finite, time-variable, and unknown verification delay, in the presence of concept drift on AL approaches. We propose PRopagate, a latency independent utility estimator which predicts the requested, but not yet known, labels. We empirically show that the proposed method consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2022-04-14T08:51:15Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.