Related papers: FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

URL: http://arxiv.org/abs/2510.12839v2
Date: Wed, 05 Nov 2025 03:36:23 GMT
Title: FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs
Authors: Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo,
Abstract summary: textbfFaStfact is an evaluation framework that achieves the highest alignment with human evaluation and time/token efficiency.<n>FaStfact first employs chunk-level claim extraction integrated with confidence-based pre-verification.<n>For searching and verification, it collects document-level evidence from crawled web-pages and selectively retrieves it during verification.
Score: 34.87719459551127
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to efficiency bottlenecks and reliability concerns. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to overcomplicated pipeline components, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence. To address these limitations, we propose \textbf{FaStfact}, an evaluation framework that achieves the highest alignment with human evaluation and time/token efficiency among existing baselines. FaStfact first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the time and token cost while ensuring reliability. For searching and verification, it collects document-level evidence from crawled web-pages and selectively retrieves it during verification. Extensive experiments based on an annotated benchmark \textbf{FaStfact-Bench} demonstrate the reliability of FaStfact in both efficiently and effectively evaluating long-form factuality. Code, benchmark data, and annotation interface tool are available at https://github.com/Yingjia-Wan/FaStfact.

Related papers

MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment [8.649665560258702]
We propose a memory-enhanced veracity assessment framework that tightly couples the retrieval and reasoning processes.<n> MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process.<n>We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs.
arXiv Detail & Related papers (2026-01-29T22:12:33Z)
User-Centric Evidence Ranking for Attribution and Fact Verification [12.059713949823335]
Evidence Ranking is a novel task that prioritizes presenting sufficient information as early as possible in a ranked list.<n>This minimizes user reading effort while still making all available evidence accessible for sequential verification.<n>This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.
arXiv Detail & Related papers (2026-01-29T08:23:26Z)
DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval [36.38599923075882]
DIVER is a retrieval pipeline designed for reasoning-intensive information retrieval.<n>It consists of four components: the document preprocessing stage, the query expansion stage, the retrieval stage and the reranking stage.<n>On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 45.8 overall and 28.9 on original queries, consistently outperforming competitive reasoning-aware models.
arXiv Detail & Related papers (2025-08-11T13:57:49Z)
Long-Form Information Alignment Evaluation Beyond Atomic Facts [60.25969380388974]
We introduce MontageLie, a benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations.<n>We propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency.
arXiv Detail & Related papers (2025-05-21T17:46:38Z)
FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.<n>We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.<n>Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z)
FIRE: Fact-checking with Iterative Retrieval and Verification [63.67320352038525]
FIRE is a novel framework that integrates evidence retrieval and claim verification in an iterative manner.<n>It achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times.<n>These results indicate that FIRE holds promise for application in large-scale fact-checking operations.
arXiv Detail & Related papers (2024-10-17T06:44:18Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
FactGenius: Combining Zero-Shot Prompting and Fuzzy Relation Mining to Improve Fact Verification with Knowledge Graphs [0.0]
We present FactGenius, a novel method that enhances fact-checking by combining zero-shot prompting of large language models with fuzzy text matching on knowledge graphs. The evaluation of FactGenius on the FactKG, a benchmark dataset for fact verification, demonstrates that it significantly outperforms existing baselines.
arXiv Detail & Related papers (2024-06-03T13:24:37Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
Factual Error Correction for Abstractive Summaries Using Entity Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process. RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary. Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.