How to Select Datapoints for Efficient Human Evaluation of NLG Models?
- URL: http://arxiv.org/abs/2501.18251v1
- Date: Thu, 30 Jan 2025 10:33:26 GMT
- Title: How to Select Datapoints for Efficient Human Evaluation of NLG Models?
- Authors: Vilém Zouhar, Peng Cui, Mrinmaya Sachan,
- Abstract summary: We develop a suite of selectors to get the most informative datapoints for human evaluation.
We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.
In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
- Score: 57.60407340254572
- License:
- Abstract: Human evaluation is the gold-standard for evaluating text generation models. It is also expensive, and to fit budgetary constraints, a random subset of the test data is often chosen in practice. The randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop a suite of selectors to get the most informative datapoints for human evaluation while taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that up to only ~50% of the test data is needed to produce the same evaluation result as the entire data. Our implementations are published in the subset2evaluate package.
Related papers
- How Many Ratings per Item are Necessary for Reliable Significance Testing? [7.777020199676859]
Most approaches to machine learning evaluation assume that machine and human responses are repeatable enough to be measured against data with unitary authoritative, "gold standard" responses.
We introduce methods for determining whether an (existing or planned) evaluation dataset has enough responses per item to reliably compare the performance of one model to another.
arXiv Detail & Related papers (2024-12-04T02:31:28Z) - SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems.
We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models.
Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z) - Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation [16.889939234103153]
We propose to variabilize benchmarks and evaluate language models dynamically.
Specifically, we extract variables from each test case and define a value range for each variable.
For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time.
arXiv Detail & Related papers (2024-06-25T16:13:53Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Statistical Model Criticism of Variational Auto-Encoders [15.005894753472894]
We propose a framework for the statistical evaluation of variational auto-encoders (VAEs)
We test two instances of this framework in the context of modelling images of handwritten digits and a corpus of English text.
arXiv Detail & Related papers (2022-04-06T18:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.