Related papers: VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

URL: http://arxiv.org/abs/2505.09701v1
Date: Wed, 14 May 2025 18:02:37 GMT
Title: VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts
Authors: Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, Lu Wang,
Abstract summary: We introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction.<n>We also introduce FactRBench, a benchmark that evaluates both precision and recall in long-form model responses.<n> Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information.
Score: 6.810019560977178
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.

Related papers

Learning to Reason for Factuality [48.08503522255537]
We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance.<n>Our model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
arXiv Detail & Related papers (2025-08-07T17:57:09Z)
RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models [9.211266032947497]
We demonstrate that fact retrieval is substantially more difficult than isolated point-wise queries.<n>Our experiments reveal that even stateofthe-art LLMs struggle significantly, not exceeding 25% factual accuracy.<n>These findings underscore limitations in current LLMs' ability to synthesize structured factual knowledge.
arXiv Detail & Related papers (2025-05-27T16:33:38Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models [59.171510592986735]
We propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response.<n>Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches.
arXiv Detail & Related papers (2025-02-25T19:01:48Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.<n>We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.<n>Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z)
FactGenius: Combining Zero-Shot Prompting and Fuzzy Relation Mining to Improve Fact Verification with Knowledge Graphs [0.0]
We present FactGenius, a novel method that enhances fact-checking by combining zero-shot prompting of large language models with fuzzy text matching on knowledge graphs. The evaluation of FactGenius on the FactKG, a benchmark dataset for fact verification, demonstrates that it significantly outperforms existing baselines.
arXiv Detail & Related papers (2024-06-03T13:24:37Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
FactKB: Generalizable Factuality Evaluation using Language Models Enhanced with Factual Knowledge [37.2179237007464]
We propose FactKB, a simple new approach to factuality evaluation that is generalizable across domains. We introduce three types of complementary factuality pretraining objectives based on direct entity facts, facts grounded in auxiliary knowledge about entities, and facts constructed compositionally through knowledge base walks. The resulting factuality evaluation model achieves state-of-the-art performance on two in-domain news summarization benchmarks and on three out-of-domain scientific literature datasets.
arXiv Detail & Related papers (2023-05-14T23:58:05Z)
Enhancing Factual Consistency of Abstractive Summarization [57.67609672082137]
We propose a fact-aware summarization model FASum to extract and integrate factual relations into the summary generation process. We then design a factual corrector model FC to automatically correct factual errors from summaries generated by existing systems.
arXiv Detail & Related papers (2020-03-19T07:36:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.