Related papers: Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection?

Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection?

URL: http://arxiv.org/abs/2407.13488v1
Date: Thu, 18 Jul 2024 13:08:55 GMT
Title: Similarity over Factuality: Are we making progress on multimodal out-of-context misinformation detection?
Authors: Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis,
Abstract summary: Out-of-context (OOC) misinformation poses a significant challenge in multimodal fact-checking. Recent research in evidence-based OOC detection has seen a trend towards increasingly complex architectures. We introduce a simple yet robust baseline, which assesses similarity between image-text pairs and external image and text evidence.
Score: 15.66049149213069
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Out-of-context (OOC) misinformation poses a significant challenge in multimodal fact-checking, where images are paired with texts that misrepresent their original context to support false narratives. Recent research in evidence-based OOC detection has seen a trend towards increasingly complex architectures, incorporating Transformers, foundation models, and large language models. In this study, we introduce a simple yet robust baseline, which assesses MUltimodal SimilaritiEs (MUSE), specifically the similarity between image-text pairs and external image and text evidence. Our results demonstrate that MUSE, when used with conventional classifiers like Decision Tree, Random Forest, and Multilayer Perceptron, can compete with and even surpass the state-of-the-art on the NewsCLIPpings and VERITE datasets. Furthermore, integrating MUSE in our proposed "Attentive Intermediate Transformer Representations" (AITR) significantly improved performance, by 3.3% and 7.5% on NewsCLIPpings and VERITE, respectively. Nevertheless, the success of MUSE, relying on surface-level patterns and shortcuts, without examining factuality and logical inconsistencies, raises critical questions about how we define the task, construct datasets, collect external evidence and overall, how we assess progress in the field. We release our code at: https://github.com/stevejpapad/outcontext-misinfo-progress

Related papers

Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations [2.139909491081949]
We propose a unified framework for fine-grained multimodal fact verification called "MultiCheck"<n>Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions.<n>We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline.
arXiv Detail & Related papers (2025-08-07T07:36:53Z)
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict [3.9495397165148023]
We show how vision-language models (VLMs) reason when confronted with scenarios that present conflicting image and text cues. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information.
arXiv Detail & Related papers (2025-04-11T20:56:52Z)
CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z)
Vision-Language Models Struggle to Align Entities across Modalities [13.100184125419695]
Cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation. Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations. We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find thatVLMs struggle significantly compared to humans.
arXiv Detail & Related papers (2025-03-05T19:36:43Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
A Reality Check on Context Utilisation for Retrieval-Augmented Generation [44.54803681476863]
We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results.
arXiv Detail & Related papers (2024-12-22T14:16:38Z)
Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention [45.31956918333587]
In multimodal sentiment analysis, collecting text data is often more challenging than video or audio. We have developed a robust model that integrates multimodal sentiment information, even in the absence of text modality.
arXiv Detail & Related papers (2024-10-19T07:59:41Z)
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We introduce MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z)
NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities. We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z)
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap. We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z)
Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z)
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms. In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks. We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources [70.68526820807402]
A real image is re-purposed to support other narratives by misrepresenting its context and/or elements. Our goal is an inspectable method that automates this time-consuming and reasoning-intensive process by fact-checking the image-context pairing. Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking.
arXiv Detail & Related papers (2021-11-30T19:36:20Z)
FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z)
The Surprising Performance of Simple Baselines for Misinformation Detection [4.060731229044571]
We examine the performance of a broad set of modern transformer-based language models. We present our framework as a baseline for creating and evaluating new methods for misinformation detection.
arXiv Detail & Related papers (2021-04-14T16:25:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.