Asking and Answering Questions to Evaluate the Factual Consistency of
Summaries
- URL: http://arxiv.org/abs/2004.04228v1
- Date: Wed, 8 Apr 2020 20:01:09 GMT
- Title: Asking and Answering Questions to Evaluate the Factual Consistency of
Summaries
- Authors: Alex Wang, Kyunghyun Cho, and Mike Lewis
- Abstract summary: We propose an automatic evaluation protocol called QAGS (pronounced "kags") to identify factual inconsistencies in a generated summary.
QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source.
We believe QAGS is a promising tool in automatically generating usable and factually consistent text.
- Score: 80.65186293015135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Practical applications of abstractive summarization models are limited by
frequent factual inconsistencies with respect to their input. Existing
automatic evaluation metrics for summarization are largely insensitive to such
errors. We propose an automatic evaluation protocol called QAGS (pronounced
"kags") that is designed to identify factual inconsistencies in a generated
summary. QAGS is based on the intuition that if we ask questions about a
summary and its source, we will receive similar answers if the summary is
factually consistent with the source. To evaluate QAGS, we collect human
judgments of factual consistency on model-generated summaries for the
CNN/DailyMail (Hermann et al., 2015) and XSUM (Narayan et al., 2018)
summarization datasets. QAGS has substantially higher correlations with these
judgments than other automatic evaluation metrics. Also, QAGS offers a natural
form of interpretability: The answers and questions generated while computing
QAGS indicate which tokens of a summary are inconsistent and why. We believe
QAGS is a promising tool in automatically generating usable and factually
consistent text.
Related papers
- Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation [64.64849950642619]
We develop an evaluation framework inspired by formal semantics for evaluating text-to-image models.
We show that Davidsonian Scene Graph (DSG) produces atomic and unique questions organized in dependency graphs.
We also present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts.
arXiv Detail & Related papers (2023-10-27T16:20:10Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - MQAG: Multiple-choice Question Answering and Generation for Assessing
Information Consistency in Summarization [55.60306377044225]
State-of-the-art summarization systems can generate highly fluent summaries.
These summaries, however, may contain factual inconsistencies and/or information not present in the source.
We introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared.
arXiv Detail & Related papers (2023-01-28T23:08:25Z) - ASQA: Factoid Questions Meet Long-Form Answers [35.11889930792675]
This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation.
Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary.
We use this notion of correctness to define an automated metric of performance for ASQA.
arXiv Detail & Related papers (2022-04-12T21:58:44Z) - AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer
Summarization [73.91543616777064]
Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions.
One goal of answer summarization is to produce a summary that reflects the range of answer perspectives.
This work introduces a novel dataset of 4,631 CQA threads for answer summarization, curated by professional linguists.
arXiv Detail & Related papers (2021-11-11T21:48:02Z) - Improving Unsupervised Question Answering via Summarization-Informed
Question Generation [47.96911338198302]
Question Generation (QG) is the task of generating a plausible question for a passage, answer> pair.
We make use of freely available news summary data, transforming declarative sentences into appropriate questions using dependency parsing, named entity recognition and semantic role labeling.
The resulting questions are then combined with the original news articles to train an end-to-end neural QG model.
arXiv Detail & Related papers (2021-09-16T13:08:43Z) - Generating Self-Contained and Summary-Centric Question Answer Pairs via
Differentiable Reward Imitation Learning [7.2745835227138045]
We propose a model for generating question-answer pairs (QA pairs) with self-contained, summary-centric questions and length-constrained, article-summarizing answers.
This dataset is used to learn a QA pair generation model producing summaries as answers that balance brevity with sufficiency jointly with their corresponding questions.
arXiv Detail & Related papers (2021-09-10T06:34:55Z) - FEQA: A Question Answering Evaluation Framework for Faithfulness
Assessment in Abstractive Summarization [34.2456005415483]
We tackle the problem of evaluating faithfulness of a generated summary given its source document.
We find that current models exhibit a trade-off between abstractiveness and faithfulness.
We propose an automatic question answering (QA) based metric for faithfulness.
arXiv Detail & Related papers (2020-05-07T21:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.