Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols
- URL: http://arxiv.org/abs/2006.06110v1
- Date: Wed, 10 Jun 2020 23:29:05 GMT
- Title: Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols
- Authors: Sarah E. Finch and Jinho D. Choi
- Abstract summary: The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
- Score: 17.14709845342071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As conversational AI-based dialogue management has increasingly become a
trending topic, the need for a standardized and reliable evaluation procedure
grows even more pressing. The current state of affairs suggests various
evaluation protocols to assess chat-oriented dialogue management systems,
rendering it difficult to conduct fair comparative studies across different
approaches and gain an insightful understanding of their values. To foster this
research, a more robust evaluation protocol must be set in place. This paper
presents a comprehensive synthesis of both automated and human evaluation
methods on dialogue systems, identifying their shortcomings while accumulating
evidence towards the most effective evaluation dimensions. A total of 20 papers
from the last two years are surveyed to analyze three types of evaluation
protocols: automated, static, and interactive. Finally, the evaluation
dimensions used in these papers are compared against our expert evaluation on
the system-user dialogue data collected from the Alexa Prize 2020.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - How to Evaluate Your Dialogue Models: A Review of Approaches [2.7834038784275403]
We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
arXiv Detail & Related papers (2021-08-03T08:52:33Z) - A Comprehensive Assessment of Dialog Evaluation Metrics [9.34612743192798]
Standard language evaluation metrics are ineffective for evaluating dialog.
Recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements.
This paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets.
arXiv Detail & Related papers (2021-06-07T15:17:03Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.