Related papers: InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs

InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs

URL: http://arxiv.org/abs/2601.06666v1
Date: Sat, 10 Jan 2026 20:00:17 GMT
Title: InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs
Authors: Yuzhuo Bai, Shuzheng Si, Kangyang Luo, Qingyi Wang, Wenhao Li, Gang Chen, Fanchao Qi, Maosong Sun,
Abstract summary: InFi-Check is a framework for interpretable and fine-grained fact-checking of large language models.<n>InFi-Checker provides supporting evidence, classify fine-grained error types, and produce justifications along with corrections.<n>Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG.
Score: 48.98720183285795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) often hallucinate, yet most existing fact-checking methods treat factuality evaluation as a binary classification problem, offering limited interpretability and failing to capture fine-grained error types. In this paper, we introduce InFi-Check, a framework for interpretable and fine-grained fact-checking of LLM outputs. Specifically, we first propose a controlled data synthesis pipeline that generates high-quality data featuring explicit evidence, fine-grained error type labels, justifications, and corrections. Based on this, we further construct large-scale training data and a manually verified benchmark InFi-Check-FG for fine-grained fact-checking of LLM outputs. Building on these high-quality training data, we further propose InFi-Checker, which can jointly provide supporting evidence, classify fine-grained error types, and produce justifications along with corrections. Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG and strong generalization across various downstream tasks, significantly improving the utility and trustworthiness of factuality evaluation.

Related papers

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Teaching Your Models to Understand Code via Focal Preference Alignment [70.71693365502212]
In existing approaches, a set of n candidate solutions is evaluated based on test case success rates.<n>Because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships.<n>We propose Target-DPO, a new preference alignment framework that mimics human iterative debug to refine Code LLMs.
arXiv Detail & Related papers (2025-03-04T16:56:34Z)
FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.<n>We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.<n>Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z)
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization. Our framework uses a diverse set of LLM prompts to identify factual inconsistencies. We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z)
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs [59.836774258359945]
OpenFactCheck is a framework for building customized automatic fact-checking systems.<n>It allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims.<n>CheckerEVAL is a solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets.
arXiv Detail & Related papers (2024-05-09T07:15:19Z)
Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks. We propose a specialized checking format called "Step CoT Check" Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.