What Went Wrong? Explaining Overall Dialogue Quality through
Utterance-Level Impacts
- URL: http://arxiv.org/abs/2111.00572v1
- Date: Sun, 31 Oct 2021 19:12:29 GMT
- Title: What Went Wrong? Explaining Overall Dialogue Quality through
Utterance-Level Impacts
- Authors: James D. Finch, Sarah E. Finch, Jinho D. Choi
- Abstract summary: This paper presents a novel approach to automated analysis of conversation logs that learns the relationship between user-system interactions and overall dialogue quality.
Our approach learns the impact of each interaction from the overall user rating without utterance-level annotation.
Experiments show that the automated analysis from our model agrees with expert judgments, making this work the first to show that such weakly-supervised learning of utterance-level quality prediction is highly achievable.
- Score: 15.018259942339448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving user experience of a dialogue system often requires intensive
developer effort to read conversation logs, run statistical analyses, and
intuit the relative importance of system shortcomings. This paper presents a
novel approach to automated analysis of conversation logs that learns the
relationship between user-system interactions and overall dialogue quality.
Unlike prior work on utterance-level quality prediction, our approach learns
the impact of each interaction from the overall user rating without
utterance-level annotation, allowing resultant model conclusions to be derived
on the basis of empirical evidence and at low cost. Our model identifies
interactions that have a strong correlation with the overall dialogue quality
in a chatbot setting. Experiments show that the automated analysis from our
model agrees with expert judgments, making this work the first to show that
such weakly-supervised learning of utterance-level quality prediction is highly
achievable.
Related papers
- Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs [19.43845920149182]
We introduce a new dialog-level annotation workflow called Dialog Quality.
DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment.
We argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
arXiv Detail & Related papers (2023-06-06T19:43:29Z) - Analyzing and Evaluating Faithfulness in Dialogue Summarization [67.07947198421421]
We first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues.
We present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations.
arXiv Detail & Related papers (2022-10-21T07:22:43Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - I like fish, especially dolphins: Addressing Contradictions in Dialogue
Modeling [104.09033240889106]
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues.
We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach.
arXiv Detail & Related papers (2020-12-24T18:47:49Z) - Turn-level Dialog Evaluation with Dialog-level Weak Signals for
Bot-Human Hybrid Customer Service Systems [0.0]
We developed a machine learning approach that quantifies multiple aspects of the success or values in Customer Service contacts, at anytime during the interaction.
We show how it improves Amazon customer service quality in several applications.
arXiv Detail & Related papers (2020-10-25T19:36:23Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.