Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
- URL: http://arxiv.org/abs/2504.01132v1
- Date: Tue, 01 Apr 2025 19:08:24 GMT
- Title: Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
- Authors: Melanie Subbiah, Akankshya Mishra, Grace Kim, Liyan Tang, Greg Durrett, Kathleen McKeown,
- Abstract summary: Forcing binary labels upon ambiguous claims lowers the reliability of evaluation.<n>We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims.<n>We show that ARM produces a absolute 21% improvement in annotator agreement on claim faithfulness.
- Score: 50.94206345567363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.
Related papers
- CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs [15.170312674645535]
CRAVE is a Conflicting Reasoning Approach for explainable claim VErification.
It can verify complex claims based on the conflicting rationales reasoned by large language models.
CRAVE achieves much better performance than state-of-the-art methods.
arXiv Detail & Related papers (2025-04-21T07:20:31Z) - Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation [29.44609627447293]
We propose an approach to summary faithfulness evaluation in which multiple agents are assigned initial stances.<n>We introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases.<n>Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
arXiv Detail & Related papers (2025-02-12T15:46:50Z) - FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.
We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.
Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z) - Contrastive Learning to Improve Retrieval for Real-world Fact Checking [84.57583869042791]
We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for fact-checking complex claims.
We leverage the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents.
We find a 6% improvement in veracity classification accuracy on the dataset.
arXiv Detail & Related papers (2024-10-07T00:09:50Z) - Defeaters and Eliminative Argumentation in Assurance 2.0 [0.0]
This report describes how defeaters, and multiple levels of defeaters, should be represented and assessed in Assurance 2.0.
A valid concern about this process is that human judgement is fallible and prone to confirmation bias.
arXiv Detail & Related papers (2024-05-16T22:10:01Z) - Longitudinal Counterfactuals: Constraints and Opportunities [59.11233767208572]
We propose using longitudinal data to assess and improve plausibility in counterfactuals.
We develop a metric that compares longitudinal differences to counterfactual differences, allowing us to evaluate how similar a counterfactual is to prior observed changes.
arXiv Detail & Related papers (2024-02-29T20:17:08Z) - AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators [38.523194864405326]
AFaCTA is a novel framework that assists in the annotation of factual claims.
AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths.
Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.
arXiv Detail & Related papers (2024-02-16T20:59:57Z) - Generating Literal and Implied Subquestions to Fact-check Complex Claims [64.81832149826035]
We focus on decomposing a complex claim into a comprehensive set of yes-no subquestions whose answers influence the veracity of the claim.
We present ClaimDecomp, a dataset of decompositions for over 1000 claims.
We show that these subquestions can help identify relevant evidence to fact-check the full claim and derive the veracity through their answers.
arXiv Detail & Related papers (2022-05-14T00:40:57Z) - AmbiFC: Fact-Checking Ambiguous Claims with Evidence [57.7091560922174]
We present AmbiFC, a fact-checking dataset with 10k claims derived from real-world information needs.
We analyze disagreements arising from ambiguity when comparing claims against evidence in AmbiFC.
We develop models for predicting veracity handling this ambiguity via soft labels.
arXiv Detail & Related papers (2021-04-01T17:40:08Z) - Towards Faithfully Interpretable NLP Systems: How should we define and
evaluate faithfulness? [58.13152510843004]
With the growing popularity of deep-learning based NLP models, comes a need for interpretable systems.
What is interpretability, and what constitutes a high-quality interpretation?
We call for more clearly differentiating between different desired criteria an interpretation should satisfy, and focus on the faithfulness criteria.
arXiv Detail & Related papers (2020-04-07T20:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.