RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
- URL: http://arxiv.org/abs/2506.12538v1
- Date: Sat, 14 Jun 2025 15:27:44 GMT
- Title: RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
- Authors: Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C. H. Ngai,
- Abstract summary: We introduce RealFactBench, a benchmark designed to assess the fact-checking capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains.<n>Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty.
- Score: 31.02873474960849
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.
Related papers
- Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM [58.42678619252968]
Creation-MMBench is a benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models.<n>The benchmark comprises 765 test cases spanning 51 fine-grained tasks.<n> Experimental results reveal that open-source MLLMs significantly underperform compared to proprietary models in creative tasks.
arXiv Detail & Related papers (2025-03-18T17:51:34Z) - Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
We introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts.<n>We employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency.
arXiv Detail & Related papers (2024-12-17T18:12:47Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models [17.052740348747424]
Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks.<n>These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly.<n>However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference.<n>We introduce MFC-Bench, a benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction.
arXiv Detail & Related papers (2024-06-17T07:51:44Z) - RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks.
Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning.
By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.