Related papers: How to Select Datapoints for Efficient Human Evaluation of NLG Models?

How to Select Datapoints for Efficient Human Evaluation of NLG Models?

URL: http://arxiv.org/abs/2501.18251v1
Date: Thu, 30 Jan 2025 10:33:26 GMT
Title: How to Select Datapoints for Efficient Human Evaluation of NLG Models?
Authors: Vilém Zouhar, Peng Cui, Mrinmaya Sachan,
Abstract summary: We develop a suite of selectors to get the most informative datapoints for human evaluation.<n>We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.<n>In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
Score: 57.60407340254572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human evaluation is the gold-standard for evaluating text generation models. It is also expensive, and to fit budgetary constraints, a random subset of the test data is often chosen in practice. The randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop a suite of selectors to get the most informative datapoints for human evaluation while taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that up to only ~50% of the test data is needed to produce the same evaluation result as the entire data. Our implementations are published in the subset2evaluate package.

Related papers

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings [23.9553588103042]
We propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves.<n>We show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity.<n>We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation.
arXiv Detail & Related papers (2025-10-30T11:28:58Z)
Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing: compromise-free item exposure [0.9208007322096533]
We formulate the optimization problem for Computer Adaptive Testing (CAT) in terms of Bayesian information theory.<n>We find that our selector has superior properties in terms of both item exposure and test accuracy/efficiency.
arXiv Detail & Related papers (2025-04-22T02:45:16Z)
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z)
CritiQ: Mining Data Quality Criteria from Human Preferences [70.35346554179036]
We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality.<n>CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.<n>We demonstrate the effectiveness of our method in the code, math, and logic domains.
arXiv Detail & Related papers (2025-02-26T16:33:41Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
How Many Ratings per Item are Necessary for Reliable Significance Testing? [7.777020199676859]
Most approaches to machine learning evaluation assume that machine and human responses are repeatable enough to be measured against data with unitary authoritative, "gold standard" responses.<n>We introduce methods for determining whether an (existing or planned) evaluation dataset has enough responses per item to reliably compare the performance of one model to another.
arXiv Detail & Related papers (2024-12-04T02:31:28Z)
SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z)
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.<n>We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.<n>We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation [16.889939234103153]
We propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time.
arXiv Detail & Related papers (2024-06-25T16:13:53Z)
A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation [17.351089059392674]
We propose a framework for model evaluation that includes stratification, sampling, and estimation components. We show that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates.
arXiv Detail & Related papers (2024-06-11T14:49:04Z)
DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality. We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z)
Evaluation of human-model prediction difference on the Internet Scale of Data [32.7296837724399]
evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space.
arXiv Detail & Related papers (2023-12-06T04:53:12Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem. We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z)
Statistical Model Criticism of Variational Auto-Encoders [15.005894753472894]
We propose a framework for the statistical evaluation of variational auto-encoders (VAEs) We test two instances of this framework in the context of modelling images of handwritten digits and a corpus of English text.
arXiv Detail & Related papers (2022-04-06T18:19:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.