Vulnerability-Aware Robust Multimodal Adversarial Training
- URL: http://arxiv.org/abs/2511.18138v1
- Date: Sat, 22 Nov 2025 17:49:45 GMT
- Title: Vulnerability-Aware Robust Multimodal Adversarial Training
- Authors: Junrui Zhang, Xinyu Zhao, Jie Peng, Chenjie Wang, Jianmin Ji, Tianlong Chen,
- Abstract summary: Multimodal learning has shown significant superiority on various tasks by integrating multiple modalities.<n>Existing methods mainly focus on attacks on specific modalities or indiscriminately attack all modalities.<n>We introduce a probe-in-training adversarial training method that improves multimodal robustness by identifying the vulnerability of each modality.
- Score: 45.350855453965615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning has shown significant superiority on various tasks by integrating multiple modalities. However, the interdependencies among modalities increase the susceptibility of multimodal models to adversarial attacks. Existing methods mainly focus on attacks on specific modalities or indiscriminately attack all modalities. In this paper, we find that these approaches ignore the differences between modalities in their contribution to final robustness, resulting in suboptimal robustness performance. To bridge this gap, we introduce Vulnerability-Aware Robust Multimodal Adversarial Training (VARMAT), a probe-in-training adversarial training method that improves multimodal robustness by identifying the vulnerability of each modality. To be specific, VARMAT first explicitly quantifies the vulnerability of each modality, grounded in a first-order approximation of the attack objective (Probe). Then, we propose a targeted regularization term that penalizes modalities with high vulnerability, guiding robust learning while maintaining task accuracy (Training). We demonstrate the enhanced robustness of our method across multiple multimodal datasets involving diverse modalities. Finally, we achieve {12.73%, 22.21%, 11.19%} robustness improvement on three multimodal datasets, revealing a significant blind spot in multimodal adversarial training.
Related papers
- From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration [12.763688592842717]
We present the first comprehensive study of adversarial vulnerability in unified multi-modal encoders.<n>Non-visual inputs, such as audio and point clouds, are especially fragile.<n>Our method improves adversarial robustness by up to 47.3 percent at epsilon = 4/255.
arXiv Detail & Related papers (2025-05-17T08:26:04Z) - Continual Multimodal Contrastive Learning [99.53621521696051]
Multimodal Contrastive Learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space.<n>However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive.<n>In this paper, we formulate CMCL through two specialized principles of stability and plasticity.<n>We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge.
arXiv Detail & Related papers (2025-03-19T07:57:08Z) - Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM)<n>Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information.<n>We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z) - AI Safety in Practice: Enhancing Adversarial Robustness in Multimodal Image Captioning [0.0]
Multimodal machine learning models that combine visual and textual data are increasingly being deployed in critical applications.
This paper presents an effective strategy to enhance the robustness of multimodal image captioning models against adversarial attacks.
arXiv Detail & Related papers (2024-07-30T20:28:31Z) - Confidence-aware multi-modality learning for eye disease screening [58.861421804458395]
We propose a novel multi-modality evidential fusion pipeline for eye disease screening.
It provides a measure of confidence for each modality and elegantly integrates the multi-modality information.
Experimental results on both public and internal datasets demonstrate that our model excels in robustness.
arXiv Detail & Related papers (2024-05-28T13:27:30Z) - MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance [10.580712937465032]
We identify the previously ignored gradient conflict between multimodal and unimodal learning objectives.
We propose MMPareto algorithm, which could ensure a final gradient with direction common to all learning objectives.
Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty.
arXiv Detail & Related papers (2024-05-28T01:19:13Z) - Quantifying and Enhancing Multi-modal Robustness with Modality Preference [9.367733452960492]
Multi-modal models are vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions.
Larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness.
Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training.
arXiv Detail & Related papers (2024-02-09T08:33:48Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Understanding and Measuring Robustness of Multimodal Learning [14.257147031953211]
We introduce a comprehensive measurement of the adversarial robustness of multimodal learning via a framework called MUROAN.
We first present a unified view of multimodal models in MUROAN and identify the fusion mechanism of multimodal models as a key vulnerability.
We then introduce a new type of multimodal adversarial attacks called decoupling attack in MUROAN that aims to compromise multimodal models.
arXiv Detail & Related papers (2021-12-22T21:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.