Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems
- URL: http://arxiv.org/abs/2212.09180v3
- Date: Fri, 28 Jul 2023 20:55:09 GMT
- Title: Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems
- Authors: Sarah E. Finch, James D. Finch, and Jinho D. Choi
- Abstract summary: This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
- Score: 12.914512702731528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite tremendous advancements in dialogue systems, stable evaluation still
requires human judgments producing notoriously high-variance metrics due to
their inherent subjectivity. Moreover, methods and labels in dialogue
evaluation are not fully standardized, especially for open-domain chats, with a
lack of work to compare and assess the validity of those approaches. The use of
inconsistent evaluation can misinform the performance of a dialogue system,
which becomes a major hurdle to enhance it. Thus, a dimensional evaluation of
chat-oriented open-domain dialogue systems that reliably measures several
aspects of dialogue capabilities is desired. This paper presents a novel human
evaluation method to estimate the rates of many dialogue system behaviors. Our
method is used to evaluate four state-of-the-art open-domain dialogue systems
and compared with existing approaches. The analysis demonstrates that our
behavior method is more suitable than alternative Likert-style or comparative
approaches for dimensional evaluation of these systems.
Related papers
- PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - Bipartite-play Dialogue Collection for Practical Automatic Evaluation of
Dialogue Systems [17.532851422548354]
This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation.
It addresses the limitations of existing dialogue collection methods.
Experimental results show that the automatic evaluation using the bipartite-play method mitigates these two drawbacks.
arXiv Detail & Related papers (2022-11-19T06:12:50Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - User Response and Sentiment Prediction for Automatic Dialogue Evaluation [69.11124655437902]
We propose to use the sentiment of the next user utterance for turn or dialog level evaluation.
Experiments show our model outperforming existing automatic evaluation metrics on both written and spoken open-domain dialogue datasets.
arXiv Detail & Related papers (2021-11-16T22:19:17Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.