Related papers: Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

URL: http://arxiv.org/abs/2203.06063v1
Date: Fri, 11 Mar 2022 16:39:15 GMT
Title: Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
Authors: Akash Kumar Mohankumar, Mitesh M. Khapra
Abstract summary: We introduce Active Evaluation, a framework to efficiently identify the top-ranked system. We show that the number of human annotations can be reduced by 80%. We also propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations.
Score: 19.547476809031764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have shown the advantages of evaluating NLG systems using pairwise comparisons as opposed to direct assessment. Given $k$ systems, a naive approach for identifying the top-ranked system would be to uniformly obtain pairwise comparisons from all ${k \choose 2}$ pairs of systems. However, this can be very expensive as the number of human annotations required would grow quadratically with $k$. In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms. We perform extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation datasets spanning 5 tasks and show that the number of human annotations can be reduced by 80%. To further reduce the number of human annotations, we propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations. Specifically, we eliminate sub-optimal systems even before the human annotation process and perform human evaluations only on test examples where the automatic metric is highly uncertain. This reduces the number of human annotations required further by 89%. In effect, we show that identifying the top-ranked system requires only a few hundred human annotations, which grow linearly with $k$. Lastly, we provide practical recommendations and best practices to identify the top-ranked system efficiently. Our code has been made publicly available at https://github.com/akashkm99/duelnlg

Related papers

On Speeding Up Language Model Evaluation [48.51924035873411]
We propose an $textitadaptive$ approach to explore this space. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate. We show that it can identify the top-performing method using only 5-15% of the typical resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling [50.08315607506652]
We propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. Experiment results show CASF receives 93.18% top-ranked system recognition accuracy.
arXiv Detail & Related papers (2024-06-12T07:44:36Z)
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models [35.17291316942284]
We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach. Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise.
arXiv Detail & Related papers (2023-10-14T05:20:02Z)
When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-making [19.605382256630534]
We analyze a type of human-algorithm collaboration where the algorithm has access to a set of $n$ items, and presents a subset of size $k$ to the human. This scenario could model content recommendation, route planning, or any type of labeling task. We show that for multiple of noise models, it is optimal to set $k in [2, n-1]$ - that is, there are strict benefits to collaborating, even when the human and algorithm have equal accuracy.
arXiv Detail & Related papers (2023-08-22T18:16:40Z)
Crowdsourcing subjective annotations using pairwise comparisons reduces bias and error compared to the majority-vote method [0.0]
We introduce a theoretical framework for understanding how random error and measurement bias enter into crowdsourced annotations of subjective constructs. We then propose a pipeline that combines pairwise comparison labelling with Elo scoring, and demonstrate that it outperforms the ubiquitous majority-voting method in reducing both types of measurement error.
arXiv Detail & Related papers (2023-05-31T17:14:12Z)
Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z)
GHRS: Graph-based Hybrid Recommendation System with Application to Movie Recommendation [0.0]
We propose a recommender system method using a graph-based model associated with the similarity of users' ratings. By utilizing the advantages of Autoencoder feature extraction, we extract new features based on all combined attributes. The experimental results on the MovieLens dataset show that the proposed algorithm outperforms many existing recommendation algorithms on recommendation accuracy.
arXiv Detail & Related papers (2021-11-06T10:47:45Z)
Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z)
Taking the Counterfactual Online: Efficient and Unbiased Online Evaluation for Ranking [74.46448041224247]
We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which optimize the policy for logging data. LogOpt turns the counterfactual approach - which is indifferent to the logging policy - into an online approach, where the algorithm decides what rankings to display. We prove that, as an online evaluation method, LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleaving methods.
arXiv Detail & Related papers (2020-07-24T18:05:58Z)
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers. We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.