Related papers: All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

URL: http://arxiv.org/abs/2510.07083v1
Date: Wed, 08 Oct 2025 14:40:33 GMT
Title: All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations
Authors: Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, Nick Craswell,
Abstract summary: Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important.<n>This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details.<n>We introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query.
Score: 57.8036236269546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.

Related papers

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models [2.0861090421004937]
Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content.<n>This review systematically analyzes how LLM-generated content is evaluated for factual accuracy.
arXiv Detail & Related papers (2025-08-05T19:20:05Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts [13.694698772833627]
We introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction.<n>We also introduce FactRBench, a benchmark that evaluates both precision and recall in long-form model responses.<n> Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information.
arXiv Detail & Related papers (2025-05-14T18:02:37Z)
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation [16.506990103937515]
We stress test a range of automatic factuality metrics to probe what they actually capture.<n>We find that all metrics show substantial performance drops on the latter.<n>Some metrics are more sensitive to benign, fact-preserving edits than to factual corrections.
arXiv Detail & Related papers (2024-11-25T18:15:15Z)
FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.<n>We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.<n>Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. We collect responses generated from large language models and annotate factuality labels in a fine-grained manner. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z)
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not. We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.