Related papers: One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

URL: http://arxiv.org/abs/2406.05491v1
Date: Sat, 8 Jun 2024 15:01:54 GMT
Title: One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Authors: Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Shutao Xia, Ke Xu,
Abstract summary: Vision-Language Pre-training models trained on large-scale image-text pairs are vulnerable to adversarial samples crafted by a malicious adversary. We propose a Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC) to exploit this vulnerability.
Score: 47.14654793461
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Vision-Language Pre-training (VLP) models trained on large-scale image-text pairs have demonstrated unprecedented capability in many practical applications. However, previous studies have revealed that VLP models are vulnerable to adversarial samples crafted by a malicious adversary. While existing attacks have achieved great success in improving attack effect and transferability, they all focus on instance-specific attacks that generate perturbations for each input sample. In this paper, we show that VLP models can be vulnerable to a new class of universal adversarial perturbation (UAP) for all input samples. Although initially transplanting existing UAP algorithms to perform attacks showed effectiveness in attacking discriminative models, the results were unsatisfactory when applied to VLP models. To this end, we revisit the multimodal alignments in VLP model training and propose the Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC). Specifically, we first design a generator that incorporates cross-modal information as conditioning input to guide the training. To further exploit cross-modal interactions, we propose to formulate the training objective as a multimodal contrastive learning paradigm based on our constructed positive and negative image-text pairs. By training the conditional generator with the designed loss, we successfully force the adversarial samples to move away from its original area in the VLP model's feature space, and thus essentially enhance the attacks. Extensive experiments show that our method achieves remarkable attack performance across various VLP models and Vision-and-Language (V+L) tasks. Moreover, C-PGC exhibits outstanding black-box transferability and achieves impressive results in fooling prevalent large VLP models including LLaVA and Qwen-VL.

Related papers

Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation [15.883062174902093]
Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs) We introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP)
arXiv Detail & Related papers (2024-12-11T05:23:34Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
Universal Adversarial Perturbations for Vision-Language Pre-trained Models [30.04163729936878]
We propose a novel black-box method to generate Universal Adversarial Perturbations (UAPs) The ETU takes into account the characteristics of UAPs and the intrinsic cross-modal interactions to generate effective UAPs. To further enhance the effectiveness and transferability of UAPs, we also design a novel data augmentation method named ScMix.
arXiv Detail & Related papers (2024-05-09T03:27:28Z)
Partially Recentralization Softmax Loss for Vision-Language Models Robustness [8.78222772167501]
We study the adversarial robustness provided by modifying loss function of pre-trained multimodal models. Our experiments show that after a fine-tuning, adversarial robustness of pre-trained models can be significantly improved, against popular attacks.
arXiv Detail & Related papers (2024-02-06T01:44:38Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models [46.14455492739906]
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting. We propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels.
arXiv Detail & Related papers (2023-10-07T02:18:52Z)
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z)
On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP. Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z)
Visual Prompting for Adversarial Robustness [63.89295305670113]
We use visual prompting computation to improve adversarial robustness of a fixed, pre-trained model at testing time. We propose a new VP method, termed Class-wise Adrial Visual Prompting (C-AVP), to generate class-wise visual prompts. C-AVP outperforms the conventional VP method, with 2.1X standard accuracy gain and 2X robust accuracy gain.
arXiv Detail & Related papers (2022-10-12T15:06:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.