Related papers: ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

URL: http://arxiv.org/abs/2412.06745v1
Date: Mon, 09 Dec 2024 18:37:14 GMT
Title: ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Authors: Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge,
Abstract summary: Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models.<n>We propose ONEBench, a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool.<n>By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets.
Score: 30.123976500620834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings [23.9553588103042]
We propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves.<n>We show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity.<n>We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation.
arXiv Detail & Related papers (2025-10-30T11:28:58Z)
How Benchmark Prediction from Fewer Data Misses the Mark [18.693874781163657]
Benchmark prediction aims to select a small subset of evaluation points and predict overall benchmark performance from that subset.<n>This paper systematically assesses the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks.
arXiv Detail & Related papers (2025-06-09T11:50:41Z)
Generalization is not a universal guarantee: Estimating similarity to training data with an ensemble out-of-distribution metric [0.09363323206192666]
Failure of machine learning models to generalize to new data is a core problem limiting the reliability of AI systems. We propose a standardized approach for assessing data similarity by constructing a supervised autoencoder for generalizability estimation (SAGE) We show that out-of-the-box model performance increases after SAGE score filtering, even when applied to data from the model's own training and test datasets.
arXiv Detail & Related papers (2025-02-22T19:21:50Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z)
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy. However, the reliability of test accuracy as the primary performance metric has been called into question. The distribution of hard samples between training and test sets affects the difficulty levels of those sets. We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z)
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models. Our approach ensures statistically aligned model rankings compared to full datasets. We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z)
A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups. We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z)
Efficient Failure Pattern Identification of Predictive Algorithms [15.02620042972929]
We propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.
arXiv Detail & Related papers (2023-06-01T14:54:42Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels. We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation. Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
One for More: Selecting Generalizable Samples for Generalizable ReID Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function. Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z)
Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.