CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
- URL: http://arxiv.org/abs/2510.11096v1
- Date: Mon, 13 Oct 2025 07:44:54 GMT
- Title: CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
- Authors: Fengling Zhu, Boshi Liu, Jingyu Hua, Sheng Zhong,
- Abstract summary: Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning.<n>Their multimodal nature exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs.<n>Existing defense strategies, such as adversarial training and input purification, face notable limitations.<n>We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models.
- Score: 4.6467356929461925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in tasks such as image captioning, visual question answering, and cross-modal reasoning by integrating visual and textual modalities. However, their multimodal nature also exposes them to adversarial threats, where attackers can perturb either modality or both jointly to induce harmful, misleading, or policy violating outputs. Existing defense strategies, such as adversarial training and input purification, face notable limitations: adversarial training typically improves robustness only against known attacks while incurring high computational costs, whereas conventional purification approaches often suffer from degraded image quality and insufficient generalization to complex multimodal tasks. In this work, we focus on defending the visual modality, which frequently serves as the primary entry point for adversarial manipulation. We propose a supervised diffusion based denoising framework that leverages paired adversarial clean image datasets to fine-tune diffusion models with directional, task specific guidance. Unlike prior unsupervised purification methods such as DiffPure, our approach achieves higher quality reconstructions while significantly improving defense robustness in multimodal tasks. Furthermore, we incorporate prompt optimization as a complementary defense mechanism, enhancing resistance against diverse and unseen attack strategies. Extensive experiments on image captioning and visual question answering demonstrate that our method not only substantially improves robustness but also exhibits strong transferability to unknown adversarial attacks. These results highlight the effectiveness of supervised diffusion based denoising for multimodal defense, paving the way for more reliable and secure deployment of MLLMs in real world applications.
Related papers
- Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction [67.45032003041399]
We propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbations.<n>SADCA establishes a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations.<n>Experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.
arXiv Detail & Related papers (2026-03-05T05:46:16Z) - Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning [28.15997901023315]
Recall is a novel adversarial framework designed to compromise the robustness of unlearned IGMs.<n>It consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original prompt.<n>These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions.
arXiv Detail & Related papers (2025-07-09T02:59:01Z) - Gradient-Free Adversarial Purification with Diffusion Models [26.591092007972325]
Adversarial training and adversarial purification are widely used to enhance model robustness against adversarial attacks.<n>In this paper, we propose an effective and efficient defense framework that counters both perturbation-based and unrestricted adversarial attacks.
arXiv Detail & Related papers (2025-01-23T02:34:14Z) - Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models [32.23201683108716]
We propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text.
Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations.
Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate.
arXiv Detail & Related papers (2024-10-07T10:06:01Z) - Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach [30.9778838504609]
Vision-language pretraining with transformers has demonstrated exceptional performance across numerous multimodal tasks.
Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities.
We propose a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities.
arXiv Detail & Related papers (2024-08-24T04:31:37Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective [42.04728834962863]
Pretrained vision-language models (VLMs) like CLIP exhibit exceptional generalization across diverse downstream tasks.
Recent studies reveal their vulnerability to adversarial attacks, with defenses against text-based and multimodal attacks remaining largely unexplored.
This work presents the first comprehensive study on improving the adversarial robustness of VLMs against attacks targeting image, text, and multimodal inputs.
arXiv Detail & Related papers (2024-04-30T06:34:21Z) - Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks [62.036798488144306]
Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked.
We propose an attack-agnostic defense method named Meta Invariance Defense (MID)
We show that MID simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration.
arXiv Detail & Related papers (2024-04-04T10:10:38Z) - Multi-granular Adversarial Attacks against Black-box Neural Ranking Models [111.58315434849047]
We create high-quality adversarial examples by incorporating multi-granular perturbations.
We transform the multi-granular attack into a sequential decision-making process.
Our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.
arXiv Detail & Related papers (2024-04-02T02:08:29Z) - Few-Shot Adversarial Prompt Learning on Vision-Language Models [62.50622628004134]
The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention.
Previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision.
We propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement.
arXiv Detail & Related papers (2024-03-21T18:28:43Z) - Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme.
Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z) - Stylized Adversarial Defense [105.88250594033053]
adversarial training creates perturbation patterns and includes them in the training set to robustify the model.
We propose to exploit additional information from the feature space to craft stronger adversaries.
Our adversarial training approach demonstrates strong robustness compared to state-of-the-art defenses.
arXiv Detail & Related papers (2020-07-29T08:38:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.