$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues
via Question Generation and Question Answering
- URL: http://arxiv.org/abs/2104.08202v1
- Date: Fri, 16 Apr 2021 16:21:16 GMT
- Title: $Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues
via Question Generation and Question Answering
- Authors: Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor,
Omri Abend
- Abstract summary: We propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models.
Our metric makes use of co-reference resolution and natural language inference capabilities.
We curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset.
- Score: 38.951535576102906
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Neural knowledge-grounded generative models for dialogue often produce
content that is factually inconsistent with the source text they rely on. As a
consequence, such models are unreliable, limiting their real-world
applicability. Inspired by recent work on evaluating factual consistency in
abstractive summarization (Durmus et al., 2020; Wang et al., 2020), we propose
an automatic evaluation metric for factual consistency in knowledge-grounded
dialogue models using automatic question generation and question answering.
Unlike previous works which use naive token-based comparison of answer spans,
our metric makes use of co-reference resolution and natural language inference
capabilities which greatly improve its performance. To foster proper
evaluation, we curate a novel dataset of state-of-the-art dialogue system
outputs for the Wizard-of-Wikipedia dataset (Dinan et al., 2019), which we
manually annotate for factual consistency. We perform a thorough
meta-evaluation of our metric against other metrics using the new dataset and
two others, where it greatly outperforms the baselines.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z) - ED-FAITH: Evaluating Dialogue Summarization on Faithfulness [35.73012379398233]
We first present a systematic study of faithfulness metrics for dialogue summarization.
We observe that most metrics correlate poorly with human judgements despite performing well on news datasets.
We propose T0-Score -- a new metric for faithfulness evaluation.
arXiv Detail & Related papers (2022-11-15T19:33:50Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.