Active Evaluation: Efficient NLG Evaluation with Few Pairwise
Comparisons
- URL: http://arxiv.org/abs/2203.06063v1
- Date: Fri, 11 Mar 2022 16:39:15 GMT
- Title: Active Evaluation: Efficient NLG Evaluation with Few Pairwise
Comparisons
- Authors: Akash Kumar Mohankumar, Mitesh M. Khapra
- Abstract summary: We introduce Active Evaluation, a framework to efficiently identify the top-ranked system.
We show that the number of human annotations can be reduced by 80%.
We also propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations.
- Score: 19.547476809031764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown the advantages of evaluating NLG systems using
pairwise comparisons as opposed to direct assessment. Given $k$ systems, a
naive approach for identifying the top-ranked system would be to uniformly
obtain pairwise comparisons from all ${k \choose 2}$ pairs of systems. However,
this can be very expensive as the number of human annotations required would
grow quadratically with $k$. In this work, we introduce Active Evaluation, a
framework to efficiently identify the top-ranked system by actively choosing
system pairs for comparison using dueling bandit algorithms. We perform
extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation
datasets spanning 5 tasks and show that the number of human annotations can be
reduced by 80%. To further reduce the number of human annotations, we propose
model-based dueling bandit algorithms which combine automatic evaluation
metrics with human evaluations. Specifically, we eliminate sub-optimal systems
even before the human annotation process and perform human evaluations only on
test examples where the automatic metric is highly uncertain. This reduces the
number of human annotations required further by 89%. In effect, we show that
identifying the top-ranked system requires only a few hundred human
annotations, which grow linearly with $k$. Lastly, we provide practical
recommendations and best practices to identify the top-ranked system
efficiently. Our code has been made publicly available at
https://github.com/akashkm99/duelnlg
Related papers
- Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling [50.08315607506652]
We propose a Constrained Active Sampling Framework (CASF) for reliable human judgment.
Experiment results show CASF receives 93.18% top-ranked system recognition accuracy.
arXiv Detail & Related papers (2024-06-12T07:44:36Z) - A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models [35.17291316942284]
We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach.
Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise.
arXiv Detail & Related papers (2023-10-14T05:20:02Z) - When Are Two Lists Better than One?: Benefits and Harms in Joint
Decision-making [19.605382256630534]
We analyze a type of human-algorithm collaboration where the algorithm has access to a set of $n$ items, and presents a subset of size $k$ to the human.
This scenario could model content recommendation, route planning, or any type of labeling task.
We show that for multiple of noise models, it is optimal to set $k in [2, n-1]$ - that is, there are strict benefits to collaborating, even when the human and algorithm have equal accuracy.
arXiv Detail & Related papers (2023-08-22T18:16:40Z) - Crowdsourcing subjective annotations using pairwise comparisons reduces
bias and error compared to the majority-vote method [0.0]
We introduce a theoretical framework for understanding how random error and measurement bias enter into crowdsourced annotations of subjective constructs.
We then propose a pipeline that combines pairwise comparison labelling with Elo scoring, and demonstrate that it outperforms the ubiquitous majority-voting method in reducing both types of measurement error.
arXiv Detail & Related papers (2023-05-31T17:14:12Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - GHRS: Graph-based Hybrid Recommendation System with Application to Movie
Recommendation [0.0]
We propose a recommender system method using a graph-based model associated with the similarity of users' ratings.
By utilizing the advantages of Autoencoder feature extraction, we extract new features based on all combined attributes.
The experimental results on the MovieLens dataset show that the proposed algorithm outperforms many existing recommendation algorithms on recommendation accuracy.
arXiv Detail & Related papers (2021-11-06T10:47:45Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Taking the Counterfactual Online: Efficient and Unbiased Online
Evaluation for Ranking [74.46448041224247]
We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which optimize the policy for logging data.
LogOpt turns the counterfactual approach - which is indifferent to the logging policy - into an online approach, where the algorithm decides what rankings to display.
We prove that, as an online evaluation method, LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleaving methods.
arXiv Detail & Related papers (2020-07-24T18:05:58Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z) - Ranking a set of objects: a graph based least-square approach [70.7866286425868]
We consider the problem of ranking $N$ objects starting from a set of noisy pairwise comparisons provided by a crowd of equal workers.
We propose a class of non-adaptive ranking algorithms that rely on a least-squares intrinsic optimization criterion for the estimation of qualities.
arXiv Detail & Related papers (2020-02-26T16:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.