How Benchmark Prediction from Fewer Data Misses the Mark
- URL: http://arxiv.org/abs/2506.07673v1
- Date: Mon, 09 Jun 2025 11:50:41 GMT
- Title: How Benchmark Prediction from Fewer Data Misses the Mark
- Authors: Guanhua Zhang, Florian E. Dorner, Moritz Hardt,
- Abstract summary: Benchmark prediction aims to select a small subset of evaluation points and predict overall benchmark performance from that subset.<n>This paper systematically assesses the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks.
- Score: 18.693874781163657
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
Related papers
- RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z) - Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z) - Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation [19.673388630963807]
We present TailoredBench, a method that conducts customized evaluation tailored to each target model.<n>A Global-coreset is first constructed as a probe to identify the most consistent source models for each target model.<n>A scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model.
arXiv Detail & Related papers (2025-02-19T09:31:50Z) - Do Contemporary Causal Inference Models Capture Real-World Heterogeneity? Findings from a Large-Scale Benchmark [39.06952509635041]
We present unexpected findings from a large-scale benchmark study evaluating Conditional Average Treatment Effect (CATE) estimation algorithms.<n>We find that 62% of CATE estimates have a higher Mean Squared Error (MSE) than a trivial zero-effect predictor, rendering them ineffective.<n>These findings highlight significant challenges in current CATE models and underscore the need for broader evaluation and methodological improvements.
arXiv Detail & Related papers (2024-10-09T16:04:40Z) - Is this model reliable for everyone? Testing for strong calibration [4.893345190925178]
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup.
The task of auditing a model for strong calibration is well-known to be difficult due to the sheer number of potential subgroups.
Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal.
arXiv Detail & Related papers (2023-07-28T00:59:14Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls
and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph.
A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task.
New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z) - Post-Selection Confidence Bounds for Prediction Performance [2.28438857884398]
In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks.
We propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set.
arXiv Detail & Related papers (2022-10-24T13:28:43Z) - A Case Study on Sampling Strategies for Evaluating Neural Sequential
Item Recommendation Models [69.32128532935403]
Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity.
We re-evaluate current state-of-the-art sequential recommender models from the point of view.
We find that both sampling strategies can produce inconsistent rankings compared with the full ranking of the models.
arXiv Detail & Related papers (2021-07-27T19:06:03Z) - Model-based metrics: Sample-efficient estimates of predictive model
subpopulation performance [11.994417027132807]
Machine learning models $-$ now commonly developed to screen, diagnose, or predict health conditions are evaluated with a variety of performance metrics.
Subpopulation performance metrics are typically computed using only data from that subgroup, resulting in higher variance estimates for smaller groups.
We propose using an evaluation model $-$ a model that describes the conditional distribution of the predictive model score $-$ to form model-based metric (MBM) estimates.
arXiv Detail & Related papers (2021-04-25T19:06:34Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.