GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating
Open-Domain Dialogue Systems
- URL: http://arxiv.org/abs/2010.03994v1
- Date: Thu, 8 Oct 2020 14:07:32 GMT
- Title: GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating
Open-Domain Dialogue Systems
- Authors: Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, Xiaodan Liang
- Abstract summary: We propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.
Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence.
Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models.
- Score: 133.13117064357425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically evaluating dialogue coherence is a challenging but high-demand
ability for developing high-quality open-domain dialogue systems. However,
current evaluation metrics consider only surface features or utterance-level
semantics, without explicitly considering the fine-grained topic transition
dynamics of dialogue flows. Here, we first consider that the graph structure
constituted with topics in a dialogue can accurately depict the underlying
communication logic, which is a more natural way to produce persuasive metrics.
Capitalized on the topic-level dialogue graph, we propose a new evaluation
metric GRADE, which stands for Graph-enhanced Representations for Automatic
Dialogue Evaluation. Specifically, GRADE incorporates both coarse-grained
utterance-level contextualized representations and fine-grained topic-level
graph representations to evaluate dialogue coherence. The graph representations
are obtained by reasoning over topic-level dialogue graphs enhanced with the
evidence from a commonsense graph, including k-hop neighboring representations
and hop-attention weights. Experimental results show that our GRADE
significantly outperforms other state-of-the-art metrics on measuring diverse
dialogue models in terms of the Pearson and Spearman correlations with human
judgements. Besides, we release a new large-scale human evaluation benchmark to
facilitate future research on automatic metrics.
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems [57.16442740983528]
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems.
Previous studies suggest using only a portion of the dialogue context in the annotation process.
This study investigates the influence of dialogue context on annotation quality.
arXiv Detail & Related papers (2024-04-15T17:56:39Z) - A Graph-to-Text Approach to Knowledge-Grounded Response Generation in
Human-Robot Interaction [2.3590037806133024]
This paper presents a novel conversational model for human--robot interaction that rests upon a graph-based representation of the dialogue state.
The neural conversational model employed to respond to user utterances relies on a simple but effective graph-to-text mechanism.
The proposed approach is empirically evaluated through a user study with a humanoid robot.
arXiv Detail & Related papers (2023-11-03T15:44:28Z) - GraphWOZ: Dialogue Management with Conversational Knowledge Graphs [2.938377447673471]
We present a new approach to dialogue management using conversational knowledge graphs as core representation of the dialogue state.
We introduce a new dataset, GraphWOZ, which comprises Wizard-of-Oz dialogues in which human participants interact with a robot acting as a receptionist.
arXiv Detail & Related papers (2022-11-23T10:53:21Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Discovering Dialog Structure Graph for Open-Domain Dialog Generation [51.29286279366361]
We conduct unsupervised discovery of dialog structure from chitchat corpora.
We then leverage it to facilitate dialog generation in downstream systems.
We present a Discrete Variational Auto-Encoder with Graph Neural Network (DVAE-GNN), to discover a unified human-readable dialog structure.
arXiv Detail & Related papers (2020-12-31T10:58:37Z) - Dialogue Relation Extraction with Document-level Heterogeneous Graph
Attention Networks [21.409522845011907]
Dialogue relation extraction (DRE) aims to detect the relation between two entities mentioned in a multi-party dialogue.
We present a graph attention network-based method for DRE where a graph contains meaningfully connected speaker, entity, entity-type, and utterance nodes.
We empirically show that this graph-based approach quite effectively captures the relations between different entity pairs in a dialogue as it outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2020-09-10T18:51:48Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.