Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
- URL: http://arxiv.org/abs/2203.05899v1
- Date: Fri, 11 Mar 2022 13:08:39 GMT
- Title: Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
- Authors: Tianbo Ji, Yvette Graham, Gareth J. F. Jones, Chenyang Lyu, Qun Liu
- Abstract summary: We present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost.
Due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation.
- Score: 24.478609926760587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation of open-domain dialogue systems is highly challenging and
development of better techniques is highlighted time and again as desperately
needed. Despite substantial efforts to carry out reliable live evaluation of
systems in recent competitions, annotations have been abandoned and reported as
too unreliable to yield sensible results. This is a serious problem since
automatic metrics are not known to provide a good indication of what may or may
not be a high-quality conversation. Answering the distress call of competitions
that have emphasized the urgent need for better evaluation techniques in
dialogue, we present the successful development of human evaluation that is
highly reliable while still remaining feasible and low cost. Self-replication
experiments reveal almost perfectly repeatable results with a correlation of
$r=0.969$. Furthermore, due to the lack of appropriate methods of statistical
significance testing, the likelihood of potential improvements to systems
occurring due to chance is rarely taken into account in dialogue evaluation,
and the evaluation we propose facilitates application of standard tests. Since
we have developed a highly reliable evaluation method, new insights into system
performance can be revealed. We therefore include a comparison of
state-of-the-art models (i) with and without personas, to measure the
contribution of personas to conversation quality, as well as (ii) prescribed
versus freely chosen topics. Interestingly with respect to personas, results
indicate that personas do not positively contribute to conversation quality as
expected.
Related papers
- MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation [0.4857223913212445]
We propose a novel system, MIRROR, to automate the evaluation process for questions generated by automated question generation systems.
We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR.
arXiv Detail & Related papers (2024-10-16T12:24:42Z) - PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - Automatic Evaluation and Moderation of Open-domain Dialogue Systems [59.305712262126264]
A long standing challenge that bothers the researchers is the lack of effective automatic evaluation metrics.
This paper describes the data, baselines and results obtained for the Track 5 at the Dialogue System Technology Challenge 10 (DSTC10)
arXiv Detail & Related papers (2021-11-03T10:08:05Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.