Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content
Dilutions
- URL: http://arxiv.org/abs/2211.02646v1
- Date: Fri, 4 Nov 2022 17:58:02 GMT
- Title: Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content
Dilutions
- Authors: Gaurav Verma, Vishwa Vinay, Ryan A. Rossi, Srijan Kumar
- Abstract summary: We develop a model that generates dilution text that maintains relevance and topical coherence with the image and existing text.
We find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model.
Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations.
- Score: 27.983902791798965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As multimodal learning finds applications in a wide variety of high-stakes
societal tasks, investigating their robustness becomes important. Existing work
has focused on understanding the robustness of vision-and-language models to
imperceptible variations on benchmark tasks. In this work, we investigate the
robustness of multimodal classifiers to cross-modal dilutions - a plausible
variation. We develop a model that, given a multimodal (image + text) input,
generates additional dilution text that (a) maintains relevance and topical
coherence with the image and existing text, and (b) when added to the original
text, leads to misclassification of the multimodal input. Via experiments on
Crisis Humanitarianism and Sentiment Detection tasks, we find that the
performance of task-specific fusion-based multimodal classifiers drops by 23.3%
and 22.5%, respectively, in the presence of dilutions generated by our model.
Metric-based comparisons with several baselines and human evaluations indicate
that our dilutions show higher relevance and topical coherence, while
simultaneously being more effective at demonstrating the brittleness of the
multimodal classifiers. Our work aims to highlight and encourage further
research on the robustness of deep multimodal models to realistic variations,
especially in human-facing societal applications. The code and other resources
are available at https://claws-lab.github.io/multimodal-robustness/.
Related papers
- Cross-Modal Consistency in Multimodal Large Language Models [33.229271701817616]
We introduce a novel concept termed cross-modal consistency.
Our experimental findings reveal a pronounced inconsistency between the vision and language modalities within GPT-4V.
Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
arXiv Detail & Related papers (2024-11-14T08:22:42Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based
Contrastive Learning for Enhanced Fusion Representation [10.44888349041063]
We introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis.
This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality.
arXiv Detail & Related papers (2023-12-04T02:58:19Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm.
Despite its widespread use, theoretical justifications in this field are still notably lacking.
This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective.
A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Logically at the Factify 2022: Multimodal Fact Verification [2.8914815569249823]
This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022.
Two baseline approaches are proposed and explored including an ensemble model and a multi-modal attention network.
Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set.
arXiv Detail & Related papers (2021-12-16T23:34:07Z) - Investigating Vulnerability to Adversarial Examples on Multimodal Data
Fusion in Deep Learning [32.125310341415755]
We investigated whether the current multimodal fusion model utilizes the complementary intelligence to defend against adversarial attacks.
We verified that the multimodal fusion model optimized for better prediction is still vulnerable to adversarial attack, even if only one of the sensors is attacked.
arXiv Detail & Related papers (2020-05-22T03:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.