Related papers: How to Evaluate Your Dialogue Models: A Review of Approaches

How to Evaluate Your Dialogue Models: A Review of Approaches

URL: http://arxiv.org/abs/2108.01369v1
Date: Tue, 3 Aug 2021 08:52:33 GMT
Title: How to Evaluate Your Dialogue Models: A Review of Approaches
Authors: Xinmeng Li, Wansen Wu, Long Qin and Quanjun Yin
Abstract summary: We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation. The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
Score: 2.7834038784275403
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating the quality of a dialogue system is an understudied problem. The recent evolution of evaluation method motivated this survey, in which an explicit and comprehensive analysis of the existing methods is sought. We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation. Then, each class is covered with main features and the related evaluation metrics. The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail. Finally, some open issues are pointed out to bring the evaluation method into a new frontier.

Related papers

SPHERE: An Evaluation Card for Human-AI Systems [75.0887588648484]
We present an evaluation card SPHERE, which encompasses five key dimensions. We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement.
arXiv Detail & Related papers (2025-03-24T20:17:20Z)
Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores [17.791039417061565]
We propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act) We demonstrate the potential effectiveness of our evaluation method through a case study.
arXiv Detail & Related papers (2023-09-30T15:14:50Z)
Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors. Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension. The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions. Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z)
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue [15.31433922183745]
We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue. MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
arXiv Detail & Related papers (2022-06-19T13:43:59Z)
FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it. To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z)
MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation [66.60285024216573]
A dialogue evaluator is expected to conduct assessment across domains as well. Most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation. We are motivated to design a general and robust framework, MDD-Eval, to address the problem.
arXiv Detail & Related papers (2021-12-14T07:01:20Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems. This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z)
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.