Related papers: Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

URL: http://arxiv.org/abs/2508.03663v1
Date: Tue, 05 Aug 2025 17:18:34 GMT
Title: Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation
Authors: Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan,
Abstract summary: We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation.<n>We find that accounting for human disagreement may come with $N times K$ at no more than 1000 for every dataset tested on at least one metric.
Score: 5.506095201822833
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple annotators for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$ -- or if one even existed -- depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop and analyze a suite of selectors to get the most informative datapoints for human evaluation.<n>We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.<n>In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Is $F_1$ Score Suboptimal for Cybersecurity Models? Introducing $C_{score}$, a Cost-Aware Alternative for Model Assessment [1.747623282473278]
False positives and false negatives are not equal and are application dependent. In cybersecurity applications, the cost of not detecting an attack is very different from marking a benign activity as an attack. We propose a new cost-aware metric, $C_score$ based on precision and recall.
arXiv Detail & Related papers (2024-07-19T21:01:19Z)
Scalable Learning of Item Response Theory Models [48.91265296134559]
Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data. We leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets.
arXiv Detail & Related papers (2024-03-01T17:12:53Z)
USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z)
More Communication Does Not Result in Smaller Generalization Error in Federated Learning [9.00236182523638]
We study the generalization error of statistical learning models in a Federated Learning setting. We consider multiple (say $R in mathbb N*$) rounds of model aggregation and study the effect of $R$ on the generalization error of the final aggregated model.
arXiv Detail & Related papers (2023-04-24T15:56:11Z)
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM [65.74934004876914]
Retrieval-enhanced language models (LMs) condition their predictions on text retrieved from large external datastores. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model. We empirically measure the effectiveness of our approach on two English language modeling datasets.
arXiv Detail & Related papers (2022-10-28T02:57:40Z)
Deconstructing Distributions: A Pointwise Framework of Learning [15.517383696434162]
We study a point's $textitprofile$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution.
arXiv Detail & Related papers (2022-02-20T23:25:28Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
Best-item Learning in Random Utility Models with Subset Choices [40.17224226373741]
We consider the problem of PAC learning the most valuable item from a pool of $n$ items using sequential, adaptively chosen plays of subsets of $k$ items. We identify a new property of such a RUM, termed the minimum advantage, that helps in characterizing the complexity of separating pairs of items. We give a learning algorithm for general RUMs, based on pairwise relative counts of items and hierarchical elimination, along with a new PAC sample complexity guarantee.
arXiv Detail & Related papers (2020-02-19T03:57:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.