VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
Pre-trained Models
- URL: http://arxiv.org/abs/2310.04655v3
- Date: Mon, 5 Feb 2024 19:33:53 GMT
- Title: VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
Pre-trained Models
- Authors: Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu,
Jinghui Chen, Ting Wang, Fenglong Ma
- Abstract summary: Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks.
Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting.
We propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels.
- Score: 46.14455492739906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language (VL) pre-trained models have shown their superiority on many
multimodal tasks. However, the adversarial robustness of such models has not
been fully explored. Existing approaches mainly focus on exploring the
adversarial robustness under the white-box setting, which is unrealistic. In
this paper, we aim to investigate a new yet practical task to craft image and
text perturbations using pre-trained VL models to attack black-box fine-tuned
models on different downstream tasks. Towards this end, we propose VLATTACK to
generate adversarial samples by fusing perturbations of images and texts from
both single-modal and multimodal levels. At the single-modal level, we propose
a new block-wise similarity attack (BSA) strategy to learn image perturbations
for disrupting universal representations. Besides, we adopt an existing text
attack strategy to generate text perturbations independent of the image-modal
attack. At the multimodal level, we design a novel iterative cross-search
attack (ICSA) method to update adversarial image-text pairs periodically,
starting with the outputs from the single-modal level. We conduct extensive
experiments to attack five widely-used VL pre-trained models for six tasks.
Experimental results show that VLATTACK achieves the highest attack success
rates on all tasks compared with state-of-the-art baselines, which reveals a
blind spot in the deployment of pre-trained VL models. Source codes can be
found at https://github.com/ericyinyzy/VLAttack.
Related papers
- AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models [41.044385916368455]
Vision-Language Models (VLMs) are vulnerable to image-based adversarial attacks.
We propose AnyAttack, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision.
arXiv Detail & Related papers (2024-10-07T09:45:18Z) - A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models [7.350203999073509]
Feature Guidance Attack (FGA) is a novel method that uses text representations to direct the perturbation of clean images.
Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings.
arXiv Detail & Related papers (2024-07-25T06:10:33Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective [32.42201363966808]
We study adapting vision-language models for adversarial robustness under the multimodal attack.
We propose a multimodal contrastive adversarial training loss, aligning the clean and adversarial text embeddings with the adversarial and clean visual features.
Experiments on 15 datasets across two tasks demonstrate that our method significantly improves the adversarial robustness of CLIP.
arXiv Detail & Related papers (2024-04-30T06:34:21Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - VQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules.
Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z) - Set-level Guidance Attack: Boosting Adversarial Transferability of
Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models.
The transferability degradation is partly caused by the under-utilization of cross-modal interactions.
We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z) - Towards Adversarial Attack on Vision-Language Pre-training Models [15.882687207499373]
This paper studied the adversarial attack on popular vision-language (V+L) models and V+L tasks.
By examining the influence of different objects and attack targets, we concluded some key observations as guidance on designing strong multimodal adversarial attack.
arXiv Detail & Related papers (2022-06-19T12:55:45Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.