Adversarial Attacks on Multimodal Agents
- URL: http://arxiv.org/abs/2406.12814v1
- Date: Tue, 18 Jun 2024 17:32:48 GMT
- Title: Adversarial Attacks on Multimodal Agents
- Authors: Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan,
- Abstract summary: Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments.
We show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment.
- Score: 73.97379283655127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack
Related papers
- Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [88.96201324719205]
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions.
We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models [41.044385916368455]
Vision-Language Models (VLMs) are vulnerable to image-based adversarial attacks.
We propose AnyAttack, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision.
arXiv Detail & Related papers (2024-10-07T09:45:18Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - VL-Trojan: Multimodal Instruction Backdoor Attacks against
Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context.
Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities.
Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images.
We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z) - Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models [73.37389786808174]
This study takes the first step in exposing Vision-Language Models' susceptibility to data poisoning attacks.
We introduce Shadowcast, a stealthy data poisoning attack where poison samples are visually indistinguishable from benign images.
We show that Shadowcast effectively achieves the attacker's intentions using as few as 50 poison samples.
arXiv Detail & Related papers (2024-02-05T18:55:53Z) - Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [62.34019142949628]
Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP.
We introduce two novel and more effective textitSelf-Generated attacks which prompt the LVLM to generate an attack against itself.
Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33%.
arXiv Detail & Related papers (2024-02-01T14:41:20Z) - InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [13.21813503235793]
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation.
In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM.
We propose an instruction-tuned targeted attack (dubbed textscInstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability.
arXiv Detail & Related papers (2023-12-04T13:40:05Z) - How Robust is Google's Bard to Adversarial Image Attacks? [45.92999116520135]
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks.
However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks.
We study the adversarial robustness of Google's Bard to better understand the vulnerabilities of commercial MLLMs.
arXiv Detail & Related papers (2023-09-21T03:24:30Z) - Image Hijacks: Adversarial Images can Control Generative Models at Runtime [8.603201325413192]
We discover image hijacks, adversarial images that control the behaviour of vision-language models at inference time.
We derive the Prompt Matching method, allowing us to train hijacks matching the behaviour of an arbitrary user-defined text prompt.
We use Behaviour Matching to craft hijacks for four types of attack, forcing VLMs to generate outputs of the adversary's choice, leak information from their context window, override their safety training, and believe false statements.
arXiv Detail & Related papers (2023-09-01T03:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.