Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
- URL: http://arxiv.org/abs/2603.04839v1
- Date: Thu, 05 Mar 2026 05:46:16 GMT
- Title: Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
- Authors: Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler,
- Abstract summary: We propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbations.<n>SADCA establishes a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations.<n>Experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.
- Score: 67.45032003041399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.
Related papers
- When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z) - Boosting the Local Invariance for Better Adversarial Transferability [4.75067406339309]
Transfer-based attacks pose a significant threat to real-world applications.<n>We propose a general adversarial transferability boosting technique called Local Invariance Boosting approach (LI-Boost)<n>Experiments on the standard ImageNet dataset demonstrate that LI-Boost could significantly boost various types of transfer-based attacks.
arXiv Detail & Related papers (2025-03-08T09:44:45Z) - Boosting Adversarial Transferability with Spatial Adversarial Alignment [56.97809949196889]
Deep neural networks are vulnerable to adversarial examples that exhibit transferability across various models.<n>We propose a technique that employs an alignment loss and leverages a witness model to fine-tune the surrogate model.<n>Experiments on various architectures on ImageNet show that aligned surrogate models based on SAA can provide higher transferable adversarial examples.
arXiv Detail & Related papers (2025-01-02T02:35:47Z) - Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs)
Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process.
We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z) - Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models [17.958154849014576]
Adversarial attacks can be used to assess the robustness of large visual-language models (VLMs)<n>Previous transfer-based adversarial attacks incur high costs due to high iteration counts and complex method structure.<n>We propose AdvDiffVLM, which uses diffusion models to generate natural, unrestricted and targeted adversarial examples.
arXiv Detail & Related papers (2024-04-16T07:19:52Z) - Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction [22.393624206051925]
Existing work rarely studies the transferability of attacks on Vision-Language Pre-training models.
We propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack)
CMI-Attack raises the transfer success rates from ALBEF to TCL, $textCLIP_textViT$ and $textCLIP_textCNN$ by 8.11%-16.75% over state-of-the-art methods.
arXiv Detail & Related papers (2024-03-16T10:32:24Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Set-level Guidance Attack: Boosting Adversarial Transferability of
Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models.
The transferability degradation is partly caused by the under-utilization of cross-modal interactions.
We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.