Cost-Optimal Active AI Model Evaluation
- URL: http://arxiv.org/abs/2506.07949v1
- Date: Mon, 09 Jun 2025 17:14:41 GMT
- Title: Cost-Optimal Active AI Model Evaluation
- Authors: Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, Adam Fisch,
- Abstract summary: Development of generative AI systems requires continual evaluation, data acquisition, and annotation.<n>We develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater.<n>We derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters.
- Score: 71.2069549142394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater -- such as a model-based autorater that is designed to automatically assess the quality of generated content -- with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target "strong" rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical inference, we derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Using synthetic and real-world data, we empirically characterize the conditions under which these policies yield improvements over prior methods. We find that, especially in tasks where there is high variability in the difficulty of examples, our policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods.
Related papers
- Fault-Tolerant Evaluation for Sample-Efficient Model Performance Estimators [13.227055178509524]
We propose a fault-tolerant evaluation framework that integrates bias and variance considerations within an adjustable tolerance level.<n>We show that proper calibration of $varepsilon$ ensures reliable evaluation across different variance regimes.<n> Experiments on real-world datasets demonstrate that our framework provides comprehensive and actionable insights into estimator behavior.
arXiv Detail & Related papers (2026-02-06T22:14:46Z) - $V_0$: A Generalist Value Model for Any Policy at State Zero [80.7505802128501]
Policy methods rely on a baseline to measure the relative advantage of an action.<n>This baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself.<n>We propose a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts.
arXiv Detail & Related papers (2026-02-03T14:35:23Z) - Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Labels or Preferences? Budget-Constrained Learning with Human Judgments over AI-Generated Outputs [17.028710603629026]
We show how to optimally allocate a fixed annotation budget between ground-truth labels and pairwise preferences in AI.<n>We introduce Preference-Calibrated Active Learning (PCAL), a novel robustness method that learns optimal data acquisition strategy.<n>This work provides a principled and statistically efficient approach for budget-constrained learning in modern AI.
arXiv Detail & Related papers (2026-01-19T23:23:29Z) - Consecutive Preferential Bayesian Optimization [5.048216954459151]
We generalize preference-based optimization to account for production and evaluation costs.<n>We empirically demonstrate a notable increase in accuracy in setups with high production costs or with indifference feedback.
arXiv Detail & Related papers (2025-11-07T11:30:36Z) - Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees [36.407171992845456]
We propose textttR-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation.<n>The key innovation of textttR-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data.
arXiv Detail & Related papers (2025-05-24T11:53:29Z) - Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index [5.714609806192087]
We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption.<n>Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts.
arXiv Detail & Related papers (2024-12-02T16:34:18Z) - Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Balanced Off-Policy Evaluation for Personalized Pricing [3.296526804364952]
We consider a personalized pricing problem in which we have data consisting of feature information, historical pricing decisions, and binary realized demand.
The goal is to perform off-policy evaluation for a new personalized pricing policy that maps features to prices.
Building on the balanced policy evaluation framework of Kallus, we propose a new approach tailored to pricing applications.
arXiv Detail & Related papers (2023-02-24T16:44:46Z) - Personalized Pricing with Invalid Instrumental Variables:
Identification, Estimation, and Policy Learning [5.372349090093469]
This work studies offline personalized pricing under endogeneity using an instrumental variable approach.
We propose a new policy learning method for Personalized pRicing using Invalid iNsTrumental variables.
arXiv Detail & Related papers (2023-02-24T14:50:47Z) - Latent State Marginalization as a Low-cost Approach for Improving
Exploration [79.12247903178934]
We propose the adoption of latent variable policies within the MaxEnt framework.
We show that latent variable policies naturally emerges under the use of world models with a latent belief state.
We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.
arXiv Detail & Related papers (2022-10-03T15:09:12Z) - Stream-based Active Learning with Verification Latency in Non-stationary
Environments [6.883906273999368]
We investigate the influence of finite, time-variable, and unknown verification delay, in the presence of concept drift on AL approaches.
We propose PRopagate, a latency independent utility estimator which predicts the requested, but not yet known, labels.
We empirically show that the proposed method consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2022-04-14T08:51:15Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.