InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs
- URL: http://arxiv.org/abs/2601.06666v1
- Date: Sat, 10 Jan 2026 20:00:17 GMT
- Title: InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs
- Authors: Yuzhuo Bai, Shuzheng Si, Kangyang Luo, Qingyi Wang, Wenhao Li, Gang Chen, Fanchao Qi, Maosong Sun,
- Abstract summary: InFi-Check is a framework for interpretable and fine-grained fact-checking of large language models.<n>InFi-Checker provides supporting evidence, classify fine-grained error types, and produce justifications along with corrections.<n>Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG.
- Score: 48.98720183285795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) often hallucinate, yet most existing fact-checking methods treat factuality evaluation as a binary classification problem, offering limited interpretability and failing to capture fine-grained error types. In this paper, we introduce InFi-Check, a framework for interpretable and fine-grained fact-checking of LLM outputs. Specifically, we first propose a controlled data synthesis pipeline that generates high-quality data featuring explicit evidence, fine-grained error type labels, justifications, and corrections. Based on this, we further construct large-scale training data and a manually verified benchmark InFi-Check-FG for fine-grained fact-checking of LLM outputs. Building on these high-quality training data, we further propose InFi-Checker, which can jointly provide supporting evidence, classify fine-grained error types, and produce justifications along with corrections. Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG and strong generalization across various downstream tasks, significantly improving the utility and trustworthiness of factuality evaluation.
Related papers
- Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z) - Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z) - Teaching Your Models to Understand Code via Focal Preference Alignment [70.71693365502212]
In existing approaches, a set of n candidate solutions is evaluated based on test case success rates.<n>Because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships.<n>We propose Target-DPO, a new preference alignment framework that mimics human iterative debug to refine Code LLMs.
arXiv Detail & Related papers (2025-03-04T16:56:34Z) - FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.<n>We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.<n>Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z) - Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs [59.836774258359945]
OpenFactCheck is a framework for building customized automatic fact-checking systems.<n>It allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims.<n>CheckerEVAL is a solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets.
arXiv Detail & Related papers (2024-05-09T07:15:19Z) - Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks.
We propose a specialized checking format called "Step CoT Check"
Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.