VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias
- URL: http://arxiv.org/abs/2304.14133v3
- Date: Wed, 18 Oct 2023 13:19:52 GMT
- Title: VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias
- Authors: Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos,
Panagiotis C. Petrantonakis
- Abstract summary: multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
- Score: 17.107961913114778
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multimedia content has become ubiquitous on social media platforms, leading
to the rise of multimodal misinformation (MM) and the urgent need for effective
strategies to detect and prevent its spread. In recent years, the challenge of
multimodal misinformation detection (MMD) has garnered significant attention by
researchers and has mainly involved the creation of annotated, weakly
annotated, or synthetically generated training datasets, along with the
development of various deep learning MMD models. However, the problem of
unimodal bias has been overlooked, where specific patterns and biases in MMD
benchmarks can result in biased or unimodal models outperforming their
multimodal counterparts on an inherently multimodal task; making it difficult
to assess progress. In this study, we systematically investigate and identify
the presence of unimodal bias in widely-used MMD benchmarks, namely VMU-Twitter
and COSMOS. To address this issue, we introduce the "VERification of Image-TExt
pairs" (VERITE) benchmark for MMD which incorporates real-world data, excludes
"asymmetric multimodal misinformation" and utilizes "modality balancing". We
conduct an extensive comparative study with a Transformer-based architecture
that shows the ability of VERITE to effectively address unimodal bias,
rendering it a robust evaluation framework for MMD. Furthermore, we introduce a
new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for
generating realistic synthetic training data that preserve crossmodal relations
between legitimate images and false human-written captions. By leveraging
CHASMA in the training process, we observe consistent and notable improvements
in predictive performance on VERITE; with a 9.2% increase in accuracy. We
release our code at: https://github.com/stevejpapad/image-text-verification
Related papers
- MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs [47.353720361676004]
multimodal misinformation detection methods often assume a single source and type of forgery for each sample.
The lack of a benchmark for mixed-source misinformation has hindered progress in this field.
We introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD.
arXiv Detail & Related papers (2024-06-13T03:04:28Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Mutual Information Regularization for Weakly-supervised RGB-D Salient
Object Detection [33.210575826086654]
We present a weakly-supervised RGB-D salient object detection model via supervision.
We focus on effective multimodal representation learning via inter-modal mutual information regularization.
arXiv Detail & Related papers (2023-06-06T12:36:57Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Synthetic Misinformers: Generating and Combating Multimodal
Misinformation [11.696058634552147]
multimodal misinformation detection (MMD) detects whether the combination of an image and its accompanying text could mislead or misinform.
We show that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy.
arXiv Detail & Related papers (2023-03-02T12:59:01Z) - Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content
Dilutions [27.983902791798965]
We develop a model that generates dilution text that maintains relevance and topical coherence with the image and existing text.
We find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model.
Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations.
arXiv Detail & Related papers (2022-11-04T17:58:02Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.