How to Evaluate Your Dialogue Models: A Review of Approaches
- URL: http://arxiv.org/abs/2108.01369v1
- Date: Tue, 3 Aug 2021 08:52:33 GMT
- Title: How to Evaluate Your Dialogue Models: A Review of Approaches
- Authors: Xinmeng Li, Wansen Wu, Long Qin and Quanjun Yin
- Abstract summary: We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
- Score: 2.7834038784275403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the quality of a dialogue system is an understudied problem. The
recent evolution of evaluation method motivated this survey, in which an
explicit and comprehensive analysis of the existing methods is sought. We are
first to divide the evaluation methods into three classes, i.e., automatic
evaluation, human-involved evaluation and user simulator based evaluation.
Then, each class is covered with main features and the related evaluation
metrics. The existence of benchmarks, suitable for the evaluation of dialogue
techniques are also discussed in detail. Finally, some open issues are pointed
out to bring the evaluation method into a new frontier.
Related papers
- Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores
from Turn-level Scores [17.791039417061565]
We propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act)
We demonstrate the potential effectiveness of our evaluation method through a case study.
arXiv Detail & Related papers (2023-09-30T15:14:50Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue [15.31433922183745]
We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue.
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
arXiv Detail & Related papers (2022-06-19T13:43:59Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue
Evaluation [66.60285024216573]
A dialogue evaluator is expected to conduct assessment across domains as well.
Most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation.
We are motivated to design a general and robust framework, MDD-Eval, to address the problem.
arXiv Detail & Related papers (2021-12-14T07:01:20Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.