Shortcomings of Question Answering Based Factuality Frameworks for Error
Localization
- URL: http://arxiv.org/abs/2210.06748v1
- Date: Thu, 13 Oct 2022 05:23:38 GMT
- Title: Shortcomings of Question Answering Based Factuality Frameworks for Error
Localization
- Authors: Ryo Kamoi, Tanya Goyal, Greg Durrett
- Abstract summary: We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries.
Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules.
Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
- Score: 51.01957350348377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress in abstractive summarization, models often generate
summaries with factual errors. Numerous approaches to detect these errors have
been proposed, the most popular of which are question answering (QA)-based
factuality metrics. These have been shown to work well at predicting
summary-level factuality and have potential to localize errors within
summaries, but this latter capability has not been systematically evaluated in
past research. In this paper, we conduct the first such analysis and find that,
contrary to our expectations, QA-based frameworks fail to correctly identify
error spans in generated summaries and are outperformed by trivial exact match
baselines. Our analysis reveals a major reason for such poor localization:
questions generated by the QG module often inherit errors from non-factual
summaries which are then propagated further into downstream modules. Moreover,
even human-in-the-loop question generation cannot easily offset these problems.
Our experiments conclusively show that there exist fundamental issues with
localization using the QA framework which cannot be fixed solely by stronger QA
and QG models.
Related papers
- Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
We demonstrate a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales.
The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving.
We take these initial observations to stimulate urgent re-assessment of the claimed capabilities of current generation of Large Language Models.
arXiv Detail & Related papers (2024-06-04T07:43:33Z) - What's under the hood: Investigating Automatic Metrics on Meeting Summarization [7.234196390284036]
Meeting summarization has become a critical task considering the increase in online interactions.
Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations.
Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.
arXiv Detail & Related papers (2024-04-17T07:15:07Z) - Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets [7.52684798377727]
We introduce Syn-(QA)$2$, a set of two synthetically generated question-answering (QA) datasets.
We find that false assumptions in QA are challenging, echoing the findings of prior work.
The detection task is more challenging with long-tail questions compared to naturally occurring questions.
arXiv Detail & Related papers (2024-03-18T18:01:26Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - A Call to Reflect on Evaluation Practices for Failure Detection in Image
Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions.
The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data.
We formalize the relevant causal structure of problems such as dynamic personalized pricing.
We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z) - Can Question Generation Debias Question Answering Models? A Case Study
on Question-Context Lexical Overlap [25.80004272277982]
Recent neural QG models are biased towards generating questions with high lexical overlap.
We propose a synonym replacement-based approach to augment questions with low lexical overlap.
arXiv Detail & Related papers (2021-09-23T09:53:54Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.