Related papers: Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

URL: http://arxiv.org/abs/2508.05097v1
Date: Thu, 07 Aug 2025 07:36:53 GMT
Title: Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations
Authors: Aditya Kishore, Gaurav Kumar, Jasabanta Patro,
Abstract summary: We propose a unified framework for fine-grained multimodal fact verification called "MultiCheck"<n>Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions.<n>We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline.
Score: 2.139909491081949
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called "MultiCheck", designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.

Related papers

METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? [3.966028515034415]
This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a valuable complementary modality for text-centric tasks.
arXiv Detail & Related papers (2025-06-21T07:32:09Z)
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.81815833343026]
We introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding.<n>Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements.<n>On ChartQA, our approach improves accuracy from 70.88% (language-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT.
arXiv Detail & Related papers (2025-05-26T08:54:14Z)
WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge [73.76722241704488]
We propose a plug-in framework named WisdoM to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis. We show that our approach has substantial improvements over several state-of-the-art methods.
arXiv Detail & Related papers (2024-01-12T16:08:07Z)
Disentangling Multi-view Representations Beyond Inductive Bias [32.15900989696017]
We propose a novel multi-view representation disentangling method that ensures both interpretability and generalizability of the resulting representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance.
arXiv Detail & Related papers (2023-08-03T09:09:28Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task. We propose a Multi-modal Context Reasoning approach, named ModCR. We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z)
End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models [0.0]
We propose end-to-end multimodal fact-checking and explanation generation. The goal is to assess the truthfulness of a claim by retrieving relevant evidence and predicting a truthfulness label. To support this research, we construct Mocheg, a large-scale dataset consisting of 15,601 claims.
arXiv Detail & Related papers (2022-05-25T04:36:46Z)
Logically at the Factify 2022: Multimodal Fact Verification [2.8914815569249823]
This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Two baseline approaches are proposed and explored including an ensemble model and a multi-modal attention network. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set.
arXiv Detail & Related papers (2021-12-16T23:34:07Z)
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources [70.68526820807402]
A real image is re-purposed to support other narratives by misrepresenting its context and/or elements. Our goal is an inspectable method that automates this time-consuming and reasoning-intensive process by fact-checking the image-context pairing. Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking.
arXiv Detail & Related papers (2021-11-30T19:36:20Z)
Universal Weighting Metric Learning for Cross-Modal Matching [79.32133554506122]
Cross-modal matching has been a highlighted research topic in both vision and language areas. We propose a simple and interpretable universal weighting framework for cross-modal matching.
arXiv Detail & Related papers (2020-10-07T13:16:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.