DynaEval: Unifying Turn and Dialogue Level Evaluation
- URL: http://arxiv.org/abs/2106.01112v2
- Date: Thu, 3 Jun 2021 07:21:35 GMT
- Title: DynaEval: Unifying Turn and Dialogue Level Evaluation
- Authors: Chen Zhang, Yiming Chen, Luis Fernando D'Haro, Yan Zhang, Thomas
Friedrichs, Grandee Lee, Haizhou Li
- Abstract summary: We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
- Score: 60.66883575106898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A dialogue is essentially a multi-turn interaction among interlocutors.
Effective evaluation metrics should reflect the dynamics of such interaction.
Existing automatic metrics are focused very much on the turn-level quality,
while ignoring such dynamics. To this end, we propose DynaEval, a unified
automatic evaluation framework which is not only capable of performing
turn-level evaluation, but also holistically considers the quality of the
entire dialogue. In DynaEval, the graph convolutional network (GCN) is adopted
to model a dialogue in totality, where the graph nodes denote each individual
utterance and the edges represent the dependency between pairs of utterances. A
contrastive loss is then applied to distinguish well-formed dialogues from
carefully constructed negative samples. Experiments show that DynaEval
significantly outperforms the state-of-the-art dialogue coherence model, and
correlates strongly with human judgements across multiple dialogue evaluation
aspects at both turn and dialogue level.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems [57.16442740983528]
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems.
Previous studies suggest using only a portion of the dialogue context in the annotation process.
This study investigates the influence of dialogue context on annotation quality.
arXiv Detail & Related papers (2024-04-15T17:56:39Z) - Toward More Accurate and Generalizable Evaluation Metrics for
Task-Oriented Dialogs [19.43845920149182]
We introduce a new dialog-level annotation workflow called Dialog Quality.
DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment.
We argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.
arXiv Detail & Related papers (2023-06-06T19:43:29Z) - What Went Wrong? Explaining Overall Dialogue Quality through
Utterance-Level Impacts [15.018259942339448]
This paper presents a novel approach to automated analysis of conversation logs that learns the relationship between user-system interactions and overall dialogue quality.
Our approach learns the impact of each interaction from the overall user rating without utterance-level annotation.
Experiments show that the automated analysis from our model agrees with expert judgments, making this work the first to show that such weakly-supervised learning of utterance-level quality prediction is highly achievable.
arXiv Detail & Related papers (2021-10-31T19:12:29Z) - WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation
for Multi-turn Dialogue [17.663449579168297]
We simulate a dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other.
The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses.
Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation.
arXiv Detail & Related papers (2021-08-01T08:00:45Z) - GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating
Open-Domain Dialogue Systems [133.13117064357425]
We propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.
Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence.
Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models.
arXiv Detail & Related papers (2020-10-08T14:07:32Z) - Is this Dialogue Coherent? Learning from Dialogue Acts and Entities [82.44143808977209]
We create the Switchboard Coherence (SWBD-Coh) corpus, a dataset of human-human spoken dialogues annotated with turn coherence ratings.
Our statistical analysis of the corpus indicates how turn coherence perception is affected by patterns of distribution of entities.
We find that models combining both DA and entity information yield the best performances both for response selection and turn coherence rating.
arXiv Detail & Related papers (2020-06-17T21:02:40Z) - Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical
Analysis of System-wise Evaluation [114.48767388174218]
This paper presents an empirical analysis on different types of dialog systems composed of different modules in different settings.
Our results show that a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels.
arXiv Detail & Related papers (2020-05-15T05:20:06Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.