Related papers: A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

URL: http://arxiv.org/abs/2601.12304v1
Date: Sun, 18 Jan 2026 08:05:33 GMT
Title: A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models
Authors: Wutao Chen, Huaqin Zou, Chen Wan, Lifeng Huang,
Abstract summary: Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines.<n>We propose 2S-GDA, a two-stage globally-diverse attack framework.<n>Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.
Score: 3.9965186683223606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17\% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models [19.899086203883254]
Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks.<n>Their reliance on visual inputs exposes them to significant adversarial threats.<n>Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM.<n>We present the first systematic study towards encoder-based adversarial transferability in LVLMs.
arXiv Detail & Related papers (2026-02-10T05:51:02Z)
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models [41.79238283279954]
HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level.<n>For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently.<n>For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures.
arXiv Detail & Related papers (2026-01-15T11:45:56Z)
When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z)
Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack [6.190046662134303]
We propose a novel attack called Local Shuffle and Sample-based Attack (LSSA)<n>LSSA randomly shuffles one of the local image blocks, thus expanding the original image-text pairs, generating adversarial images, and sampling around them.<n>Experiments on multiple models and datasets demonstrate that LSSA significantly enhances the transferability of multimodal adversarial examples.
arXiv Detail & Related papers (2025-11-02T06:55:49Z)
Universal Camouflage Attack on Vision-Language Models for Autonomous Driving [67.34987318443761]
Visual language modeling for automated driving is emerging as a promising research direction.<n>VLM-AD remains vulnerable to serious security threats from adversarial attacks.<n>We propose the first Universal Camouflage Attack framework for VLM-AD.
arXiv Detail & Related papers (2025-09-24T14:52:01Z)
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models [26.656858396343726]
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations.<n>Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data.<n>We explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data.
arXiv Detail & Related papers (2025-02-03T17:59:45Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
Cross-Modality Attack Boosted by Gradient-Evolutionary Multiform Optimization [4.226449585713182]
Cross-modal adversarial attacks pose significant challenges to attack transferability. We propose a novel cross-modal adversarial attack strategy, termed multiform attack. We demonstrate the superiority and robustness of Multiform Attack compared to existing techniques.
arXiv Detail & Related papers (2024-09-26T15:52:34Z)
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.