FEQA: A Question Answering Evaluation Framework for Faithfulness
Assessment in Abstractive Summarization
- URL: http://arxiv.org/abs/2005.03754v1
- Date: Thu, 7 May 2020 21:00:08 GMT
- Title: FEQA: A Question Answering Evaluation Framework for Faithfulness
Assessment in Abstractive Summarization
- Authors: Esin Durmus and He He and Mona Diab
- Abstract summary: We tackle the problem of evaluating faithfulness of a generated summary given its source document.
We find that current models exhibit a trade-off between abstractiveness and faithfulness.
We propose an automatic question answering (QA) based metric for faithfulness.
- Score: 34.2456005415483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural abstractive summarization models are prone to generate content
inconsistent with the source document, i.e. unfaithful. Existing automatic
metrics do not capture such mistakes effectively. We tackle the problem of
evaluating faithfulness of a generated summary given its source document. We
first collected human annotations of faithfulness for outputs from numerous
models on two datasets. We find that current models exhibit a trade-off between
abstractiveness and faithfulness: outputs with less word overlap with the
source document are more likely to be unfaithful. Next, we propose an automatic
question answering (QA) based metric for faithfulness, FEQA, which leverages
recent advances in reading comprehension. Given question-answer pairs generated
from the summary, a QA model extracts answers from the document; non-matched
answers indicate unfaithful information in the summary. Among metrics based on
word overlap, embedding similarity, and learned language understanding models,
our QA-based metric has significantly higher correlation with human
faithfulness scores, especially on highly abstractive summaries.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - STORYSUMM: Evaluating Faithfulness in Story Summarization [31.94902013480574]
We introduce a new dataset, STORYSUMM, comprising short stories with localized faithfulness labels and error explanations.
This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies.
arXiv Detail & Related papers (2024-07-09T02:06:30Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - MQAG: Multiple-choice Question Answering and Generation for Assessing
Information Consistency in Summarization [55.60306377044225]
State-of-the-art summarization systems can generate highly fluent summaries.
These summaries, however, may contain factual inconsistencies and/or information not present in the source.
We introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared.
arXiv Detail & Related papers (2023-01-28T23:08:25Z) - HaRiM$^+$: Evaluating Summary Quality with Hallucination Risk [0.6617666829632144]
We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods.
For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval.
arXiv Detail & Related papers (2022-11-22T09:36:41Z) - Towards Improving Faithfulness in Abstractive Summarization [37.19777407790153]
We propose a Faithfulness Enhanced Summarization model (FES) to improve fidelity in abstractive summarization.
Our model outperforms strong baselines in experiments on CNN/DM and XSum.
arXiv Detail & Related papers (2022-10-04T19:52:09Z) - Extractive is not Faithful: An Investigation of Broad Unfaithfulness
Problems in Extractive Summarization [91.86501509439815]
In this work, we define a typology with five types of broad unfaithfulness problems that can appear in extractive summaries.
We ask humans to label these problems out of 1600 English summaries produced by 16 diverse extractive systems.
To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment.
arXiv Detail & Related papers (2022-09-08T03:25:18Z) - Asking and Answering Questions to Evaluate the Factual Consistency of
Summaries [80.65186293015135]
We propose an automatic evaluation protocol called QAGS (pronounced "kags") to identify factual inconsistencies in a generated summary.
QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source.
We believe QAGS is a promising tool in automatically generating usable and factually consistent text.
arXiv Detail & Related papers (2020-04-08T20:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.