MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
- URL: http://arxiv.org/abs/2406.11288v3
- Date: Mon, 17 Feb 2025 02:00:52 GMT
- Title: MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
- Authors: Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma,
- Abstract summary: Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks.<n>These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly.<n>However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference.<n>We introduce MFC-Bench, a benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction.
- Score: 17.052740348747424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for MFC: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field.
Related papers
- RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking [31.02873474960849]
We introduce RealFactBench, a benchmark designed to assess the fact-checking capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains.<n>Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty.
arXiv Detail & Related papers (2025-06-14T15:27:44Z) - Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models [19.361686225381447]
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL)<n>We propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer.
arXiv Detail & Related papers (2025-06-09T16:55:32Z) - MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs [31.793037002996257]
Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination.
Existing work primarily focuses on generating citations for text-only content, overlooking the challenges and opportunities of multimodal contexts.
We introduce MCiteBench, the first benchmark designed to evaluate and analyze the multimodal citation text generation ability of MLLMs.
arXiv Detail & Related papers (2025-03-04T13:12:39Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs [37.52094200472755]
This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias.
We first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs.
We also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias.
arXiv Detail & Related papers (2025-02-23T15:04:23Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites [6.796136787585992]
Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development.
This paper explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models.
arXiv Detail & Related papers (2024-08-21T15:54:17Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
We introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data.
Our experimental results reveal a significant performance gap between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks.
arXiv Detail & Related papers (2024-02-19T08:12:30Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.