Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
- URL: http://arxiv.org/abs/2502.19672v1
- Date: Thu, 27 Feb 2025 01:33:19 GMT
- Title: Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
- Authors: Chenhe Gu, Jindong Gu, Andong Hua, Yao Qin,
- Abstract summary: We introduce the Dynamic Vision-Language Alignment (DynVLA) Attack, a novel approach that injects dynamic perturbations into the vision-language connector to enhance generalization across diverse vision-language alignment of different models.<n>Our experimental results show that DynVLA significantly improves the transferability of adversarial examples across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and closed-source models such as Gemini.
- Score: 16.70399451598529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs), built upon LLMs, have recently gained attention for their capabilities in image recognition and understanding. However, while MLLMs are vulnerable to adversarial attacks, the transferability of these attacks across different models remains limited, especially under targeted attack setting. Existing methods primarily focus on vision-specific perturbations but struggle with the complex nature of vision-language modality alignment. In this work, we introduce the Dynamic Vision-Language Alignment (DynVLA) Attack, a novel approach that injects dynamic perturbations into the vision-language connector to enhance generalization across diverse vision-language alignment of different models. Our experimental results show that DynVLA significantly improves the transferability of adversarial examples across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and closed-source models such as Gemini.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models [19.899086203883254]
Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks.<n>Their reliance on visual inputs exposes them to significant adversarial threats.<n>Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM.<n>We present the first systematic study towards encoder-based adversarial transferability in LVLMs.
arXiv Detail & Related papers (2026-02-10T05:51:02Z) - When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z) - Universal Camouflage Attack on Vision-Language Models for Autonomous Driving [67.34987318443761]
Visual language modeling for automated driving is emerging as a promising research direction.<n>VLM-AD remains vulnerable to serious security threats from adversarial attacks.<n>We propose the first Universal Camouflage Attack framework for VLM-AD.
arXiv Detail & Related papers (2025-09-24T14:52:01Z) - MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z) - Transferable Adversarial Attacks on Black-Box Vision-Language Models [63.22532779621001]
adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts.<n>We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information.<n>We discover that universal perturbations -- modifications applicable to a wide set of images -- can consistently induce these misinterpretations.
arXiv Detail & Related papers (2025-05-02T06:51:11Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.
LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.
We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z) - Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation [15.883062174902093]
Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs)<n>We introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP)
arXiv Detail & Related papers (2024-12-11T05:23:34Z) - SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities.
Training adapters with image-level supervision often results in significant misalignment.
We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z) - Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models [27.955342181784797]
There is currently no systematic research on the threat of cross-MLLMs adversarial transferability.
We propose a boosting method called Typography Augment Transferability Method (TATM) to investigate the adversarial transferability performance across MLLMs.
arXiv Detail & Related papers (2024-05-30T14:27:20Z) - Adversarial Robustness for Visual Grounding of Multimodal Large Language Models [49.71757071535619]
Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks.
adversarial robustness of visual grounding remains unexplored in MLLMs.
We propose three adversarial attack paradigms as follows.
arXiv Detail & Related papers (2024-05-16T10:54:26Z) - Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction [22.393624206051925]
Existing work rarely studies the transferability of attacks on Vision-Language Pre-training models.
We propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack)
CMI-Attack raises the transfer success rates from ALBEF to TCL, $textCLIP_textViT$ and $textCLIP_textCNN$ by 8.11%-16.75% over state-of-the-art methods.
arXiv Detail & Related papers (2024-03-16T10:32:24Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - Set-level Guidance Attack: Boosting Adversarial Transferability of
Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models.
The transferability degradation is partly caused by the under-utilization of cross-modal interactions.
We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.