Related papers: Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

URL: http://arxiv.org/abs/2309.00236v4
Date: Tue, 17 Sep 2024 19:56:09 GMT
Title: Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Authors: Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons,
Abstract summary: We discover image hijacks, adversarial images that control the behaviour of vision-language models at inference time. We derive the Prompt Matching method, allowing us to train hijacks matching the behaviour of an arbitrary user-defined text prompt. We use Behaviour Matching to craft hijacks for four types of attack, forcing VLMs to generate outputs of the adversary's choice, leak information from their context window, override their safety training, and believe false statements.
Score: 8.603201325413192
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Are foundation models secure against malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control the behaviour of VLMs at inference time, and introduce the general Behaviour Matching algorithm for training image hijacks. From this, we derive the Prompt Matching method, allowing us to train hijacks matching the behaviour of an arbitrary user-defined text prompt (e.g. 'the Eiffel Tower is now located in Rome') using a generic, off-the-shelf dataset unrelated to our choice of prompt. We use Behaviour Matching to craft hijacks for four types of attack, forcing VLMs to generate outputs of the adversary's choice, leak information from their context window, override their safety training, and believe false statements. We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all attack types achieve a success rate of over 80%. Moreover, our attacks are automated and require only small image perturbations.

Related papers

VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models [15.158545794377169]
We frame the preservation of privacy in Vision-Language Models as an adversarial attack problem.<n>We propose a novel attack strategy that selectively conceals information within designated Region Of Interests in an image.<n> Experimental results across three state-of-the-art VLMs demonstrate up to 98% reduction in detecting targeted ROIs.
arXiv Detail & Related papers (2025-07-11T19:34:01Z)
On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling [24.730395152276927]
A text-to-image generative model is trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs)<n>VLMs are vulnerable to stealthy adversarial attacks, where perturbations are added to images to mislead the VLMs into producing incorrect captions.<n>We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers.
arXiv Detail & Related papers (2025-06-27T03:13:47Z)
Image Corruption-Inspired Membership Inference Attacks against Large Vision-Language Models [27.04420374256226]
Large vision-language models (LVLMs) have demonstrated outstanding performance in many downstream tasks.<n>It is important to detect whether an image is used to train the LVLM.<n>Recent studies have investigated membership inference attacks (MIAs) against LVLMs.
arXiv Detail & Related papers (2025-06-14T04:22:36Z)
Typographic Attacks in a Multi-Image Setting [2.9154316123656927]
We introduce a multi-image setting for studying typographic attacks. Specifically, our focus is on attacking image sets without repeating the attack query. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity.
arXiv Detail & Related papers (2025-02-12T08:10:25Z)
`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs [6.151779089440453]
We introduce the first voice-based jailbreak attack against multimodal large language models (LLMs) We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs.
arXiv Detail & Related papers (2025-02-02T10:05:08Z)
AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models [41.044385916368455]
Vision-Language Models (VLMs) are vulnerable to image-based adversarial attacks. We propose AnyAttack, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision.
arXiv Detail & Related papers (2024-10-07T09:45:18Z)
Vera Verto: Multimodal Hijacking Attack [22.69532868255637]
A recent attack in this domain is the model hijacking attack, whereby an adversary hijacks a victim model to implement their own hijacking tasks. We transform the model hijacking attack into a more general multimodal setting, where the hijacking and original tasks are performed on data of different modalities. Our attack achieves 94%, 94%, and 95% attack success rate when using the Sogou news dataset to hijack STL10, CIFAR-10, and MNISTs.
arXiv Detail & Related papers (2024-07-31T19:37:06Z)
Adversarial Attacks on Multimodal Agents [73.97379283655127]
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. We show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models [58.21452697997078]
We propose a novel VQAttack model, which can generate both image and text perturbations with the designed modules. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack.
arXiv Detail & Related papers (2024-02-16T21:17:42Z)
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [62.34019142949628]
Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. We introduce two novel and more effective textitSelf-Generated attacks which prompt the LVLM to generate an attack against itself. Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33%.
arXiv Detail & Related papers (2024-02-01T14:41:20Z)
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [13.21813503235793]
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM. We propose an instruction-tuned targeted attack (dubbed textscInstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability.
arXiv Detail & Related papers (2023-12-04T13:40:05Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints [15.643898659673036]
We show that despite their versatility, CLIP models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts. We demonstrate how fooling master images for CLIPMasterPrints can be mined using gradient descent, projected descent, or blackbox optimization.
arXiv Detail & Related papers (2023-07-07T18:54:11Z)
Dual Manifold Adversarial Robustness: Defense against Lp and non-Lp Adversarial Attacks [154.31827097264264]
Adversarial training is a popular defense strategy against attack threat models with bounded Lp norms. We propose Dual Manifold Adversarial Training (DMAT) where adversarial perturbations in both latent and image spaces are used in robustifying the model. Our DMAT improves performance on normal images, and achieves comparable robustness to the standard adversarial training against Lp attacks.
arXiv Detail & Related papers (2020-09-05T06:00:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.