DEAM: Dialogue Coherence Evaluation using AMR-based Semantic
Manipulations
- URL: http://arxiv.org/abs/2203.09711v1
- Date: Fri, 18 Mar 2022 03:11:35 GMT
- Title: DEAM: Dialogue Coherence Evaluation using AMR-based Semantic
Manipulations
- Authors: Sarik Ghazarian, Nuan Wen, Aram Galstyan, Nanyun Peng
- Abstract summary: We propose a Dialogue Evaluation metric that relies on AMR-based semantic manipulations for incoherent data generation.
Our experiments show that DEAM achieves higher correlations with human judgments compared to baseline methods.
- Score: 46.942369532632604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic evaluation metrics are essential for the rapid development of
open-domain dialogue systems as they facilitate hyper-parameter tuning and
comparison between models. Although recently proposed trainable
conversation-level metrics have shown encouraging results, the quality of the
metrics is strongly dependent on the quality of training data. Prior works
mainly resort to heuristic text-level manipulations (e.g. utterances shuffling)
to bootstrap incoherent conversations (negative examples) from coherent
dialogues (positive examples). Such approaches are insufficient to
appropriately reflect the incoherence that occurs in interactions between
advanced dialogue models and humans. To tackle this problem, we propose DEAM, a
Dialogue coherence Evaluation metric that relies on Abstract Meaning
Representation (AMR) to apply semantic-level Manipulations for incoherent
(negative) data generation. AMRs naturally facilitate the injection of various
types of incoherence sources, such as coreference inconsistency, irrelevancy,
contradictions, and decrease engagement, at the semantic level, thus resulting
in more natural incoherent samples. Our experiments show that DEAM achieves
higher correlations with human judgments compared to baseline methods on
several dialog datasets by significant margins. We also show that DEAM can
distinguish between coherent and incoherent dialogues generated by baseline
manipulations, whereas those baseline models cannot detect incoherent examples
generated by DEAM. Our results demonstrate the potential of AMR-based semantic
manipulations for natural negative example generation.
Related papers
- Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation [26.330012489735456]
This paper proposes an effective framework for open-domain dialogue evaluation.
It combines domain-specific language models (SLMs) enhanced with Abstract Meaning Representation (AMR) knowledge with Large Language Models (LLMs)
Experimental results on open-domain dialogue evaluation tasks demonstrate the superiority of our method compared to a wide range of state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-01T14:11:45Z) - PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z) - Synthesizing Adversarial Negative Responses for Robust Response Ranking
and Evaluation [34.52276336319678]
Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks.
Over-reliance on content similarity makes the models less sensitive to the presence of inconsistencies.
We propose approaches for automatically creating adversarial negative training data.
arXiv Detail & Related papers (2021-06-10T16:20:55Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - I like fish, especially dolphins: Addressing Contradictions in Dialogue
Modeling [104.09033240889106]
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues.
We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach.
arXiv Detail & Related papers (2020-12-24T18:47:49Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.