Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores
from Turn-level Scores
- URL: http://arxiv.org/abs/2310.00410v1
- Date: Sat, 30 Sep 2023 15:14:50 GMT
- Title: Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores
from Turn-level Scores
- Authors: Rikiya Takehi, Akihisa Watanabe and Tetsuya Sakai
- Abstract summary: We propose an evaluation approach where a turn is decomposed into nuggets (i.e., expressions associated with a dialogue act)
We demonstrate the potential effectiveness of our evaluation method through a case study.
- Score: 17.791039417061565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing dialogue quality evaluation systems can return a score for a given
system turn from a particular viewpoint, e.g., engagingness. However, to
improve dialogue systems by locating exactly where in a system turn potential
problems lie, a more fine-grained evaluation may be necessary. We therefore
propose an evaluation approach where a turn is decomposed into nuggets (i.e.,
expressions associated with a dialogue act), and nugget-level evaluation is
enabled by leveraging an existing turn-level evaluation system. We demonstrate
the potential effectiveness of our evaluation method through a case study.
Related papers
- C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs [19.43845920149182]
We introduce a new dialog-level annotation workflow called Dialog Quality.
DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment.
We argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
arXiv Detail & Related papers (2023-06-06T19:43:29Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Don't Forget Your ABC's: Evaluating the State-of-the-Art in
Chat-Oriented Dialogue Systems [12.914512702731528]
This paper presents a novel human evaluation method to estimate the rates of many dialogue system behaviors.
Our method is used to evaluate four state-of-the-art open-domain dialogue systems and compared with existing approaches.
arXiv Detail & Related papers (2022-12-18T22:07:55Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - How to Evaluate Your Dialogue Models: A Review of Approaches [2.7834038784275403]
We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
arXiv Detail & Related papers (2021-08-03T08:52:33Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.