Shortcomings of Question Answering Based Factuality Frameworks for Error
Localization
- URL: http://arxiv.org/abs/2210.06748v1
- Date: Thu, 13 Oct 2022 05:23:38 GMT
- Title: Shortcomings of Question Answering Based Factuality Frameworks for Error
Localization
- Authors: Ryo Kamoi, Tanya Goyal, Greg Durrett
- Abstract summary: We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries.
Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules.
Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
- Score: 51.01957350348377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress in abstractive summarization, models often generate
summaries with factual errors. Numerous approaches to detect these errors have
been proposed, the most popular of which are question answering (QA)-based
factuality metrics. These have been shown to work well at predicting
summary-level factuality and have potential to localize errors within
summaries, but this latter capability has not been systematically evaluated in
past research. In this paper, we conduct the first such analysis and find that,
contrary to our expectations, QA-based frameworks fail to correctly identify
error spans in generated summaries and are outperformed by trivial exact match
baselines. Our analysis reveals a major reason for such poor localization:
questions generated by the QG module often inherit errors from non-factual
summaries which are then propagated further into downstream modules. Moreover,
even human-in-the-loop question generation cannot easily offset these problems.
Our experiments conclusively show that there exist fundamental issues with
localization using the QA framework which cannot be fixed solely by stronger QA
and QG models.
Related papers
- Is Q-learning an Ill-posed Problem? [2.4424095531386234]
This paper investigates the instability of Q-learning in continuous environments.
We show that even in relatively simple benchmarks, the fundamental task of Q-learning can be inherently ill-posed and prone to failure.
arXiv Detail & Related papers (2025-02-20T08:42:30Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.
Models may behave unreliably due to poorly explored failure modes.
causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.
Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.
We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z) - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
We demonstrate a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales.
The breakdown is dramatic, as models show strong fluctuations across even slight problem variations that should not affect problem solving.
We take these initial observations to stimulate urgent re-assessment of the claimed capabilities of current generation of Large Language Models.
arXiv Detail & Related papers (2024-06-04T07:43:33Z) - What's under the hood: Investigating Automatic Metrics on Meeting Summarization [7.234196390284036]
Meeting summarization has become a critical task considering the increase in online interactions.
Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations.
Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.
arXiv Detail & Related papers (2024-04-17T07:15:07Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - A Call to Reflect on Evaluation Practices for Failure Detection in Image
Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions.
The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z) - Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data.
We formalize the relevant causal structure of problems such as dynamic personalized pricing.
We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z) - Can Question Generation Debias Question Answering Models? A Case Study
on Question-Context Lexical Overlap [25.80004272277982]
Recent neural QG models are biased towards generating questions with high lexical overlap.
We propose a synonym replacement-based approach to augment questions with low lexical overlap.
arXiv Detail & Related papers (2021-09-23T09:53:54Z) - Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a
Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view.
It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer.
We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.