VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias
- URL: http://arxiv.org/abs/2304.14133v3
- Date: Wed, 18 Oct 2023 13:19:52 GMT
- Title: VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias
- Authors: Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos,
Panagiotis C. Petrantonakis
- Abstract summary: multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
- Score: 17.107961913114778
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multimedia content has become ubiquitous on social media platforms, leading
to the rise of multimodal misinformation (MM) and the urgent need for effective
strategies to detect and prevent its spread. In recent years, the challenge of
multimodal misinformation detection (MMD) has garnered significant attention by
researchers and has mainly involved the creation of annotated, weakly
annotated, or synthetically generated training datasets, along with the
development of various deep learning MMD models. However, the problem of
unimodal bias has been overlooked, where specific patterns and biases in MMD
benchmarks can result in biased or unimodal models outperforming their
multimodal counterparts on an inherently multimodal task; making it difficult
to assess progress. In this study, we systematically investigate and identify
the presence of unimodal bias in widely-used MMD benchmarks, namely VMU-Twitter
and COSMOS. To address this issue, we introduce the "VERification of Image-TExt
pairs" (VERITE) benchmark for MMD which incorporates real-world data, excludes
"asymmetric multimodal misinformation" and utilizes "modality balancing". We
conduct an extensive comparative study with a Transformer-based architecture
that shows the ability of VERITE to effectively address unimodal bias,
rendering it a robust evaluation framework for MMD. Furthermore, we introduce a
new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for
generating realistic synthetic training data that preserve crossmodal relations
between legitimate images and false human-written captions. By leveraging
CHASMA in the training process, we observe consistent and notable improvements
in predictive performance on VERITE; with a 9.2% increase in accuracy. We
release our code at: https://github.com/stevejpapad/image-text-verification
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection [61.71770293720491]
We propose a novel two-stage Robust modAlity-imcomplete fusing and Detecting frAmewoRk, abbreviated as RADAR.
Our bootstrapping philosophy is to enhance two stages in MIIAD, improving the robustness of the Multimodal Transformer.
Our experimental results demonstrate that the proposed RADAR significantly surpasses conventional MIAD methods in terms of effectiveness and robustness.
arXiv Detail & Related papers (2024-10-02T16:47:55Z) - Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs [47.353720361676004]
multimodal misinformation detection methods often assume a single source and type of forgery for each sample.
The lack of a benchmark for mixed-source misinformation has hindered progress in this field.
We introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD.
arXiv Detail & Related papers (2024-06-13T03:04:28Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Mutual Information Regularization for Weakly-supervised RGB-D Salient
Object Detection [33.210575826086654]
We present a weakly-supervised RGB-D salient object detection model via supervision.
We focus on effective multimodal representation learning via inter-modal mutual information regularization.
arXiv Detail & Related papers (2023-06-06T12:36:57Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Synthetic Misinformers: Generating and Combating Multimodal
Misinformation [11.696058634552147]
multimodal misinformation detection (MMD) detects whether the combination of an image and its accompanying text could mislead or misinform.
We show that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy.
arXiv Detail & Related papers (2023-03-02T12:59:01Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.