PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems
- URL: http://arxiv.org/abs/2004.02399v1
- Date: Mon, 6 Apr 2020 04:36:33 GMT
- Title: PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems
- Authors: Tian Lan, Xian-Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang
- Abstract summary: There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
- Score: 48.99561874529323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-domain generative dialogue systems have attracted considerable attention
over the past few years. Currently, how to automatically evaluate them, is
still a big challenge problem. As far as we know, there are three kinds of
automatic methods to evaluate the open-domain generative dialogue systems: (1)
Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based
metrics. Due to the lack of systematic comparison, it is not clear which kind
of metrics are more effective. In this paper, we will first measure
systematically all kinds of automatic evaluation metrics over the same
experimental setting to check which kind is best. Through extensive
experiments, the learning-based metrics are demonstrated that they are the most
effective evaluation metrics for open-domain generative dialogue systems.
Moreover, we observe that nearly all learning-based metrics depend on the
negative sampling mechanism, which obtains an extremely imbalanced and
low-quality dataset to train a score model. In order to address this issue, we
propose a novel and feasible learning-based metric that can significantly
improve the correlation with human judgments by using augmented POsitive
samples and valuable NEgative samples, called PONE. Extensive experiments
demonstrate that our proposed evaluation method significantly outperforms the
state-of-the-art learning-based evaluation methods, with an average correlation
improvement of 13.18%. In addition, we have publicly released the codes of our
proposed method and state-of-the-art baselines.
Related papers
- How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks.
We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent.
Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - Using Active Learning Methods to Strategically Select Essays for
Automated Scoring [0.0]
The purpose of this study is to describe and evaluate three active learning methods.
The three active learning methods are the uncertainty-based, the topological-based, and the hybrid method.
All three methods produced strong results, with the topological-based method producing the most efficient classification.
arXiv Detail & Related papers (2023-01-02T12:46:10Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.