Related papers: When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

URL: http://arxiv.org/abs/2407.15211v1
Date: Sun, 21 Jul 2024 16:27:24 GMT
Title: When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Authors: Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez,
Abstract summary: We focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conduct a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" We find that transferable gradient-based image jailbreaks are extremely difficult to obtain.
Score: 20.385314634225978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image "jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of "highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

Related papers

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks [4.937150501683971]
Vision-Language Models (VLMs) are now a core part of modern AI.<n>Recent work proposed several visual jailbreak attacks using single/ holistic images.<n>We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment.
arXiv Detail & Related papers (2026-02-08T21:52:42Z)
Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography [77.44136793431893]
We propose a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries into benign-looking images.<n>Our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, achieving up to 99% attack success rate.
arXiv Detail & Related papers (2025-12-23T08:53:36Z)
Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models [24.65236224895181]
Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface.<n>We introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method.<n>SEA exploits inherited vulnerabilities from the base model, significantly enhancing transferability.
arXiv Detail & Related papers (2025-08-03T12:51:47Z)
Attention! You Vision Language Model Could Be Maliciously Manipulated [5.504125658123538]
We propose a novel Vision-language model Manipulation Attack (VMA)<n>VMA integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation.<n>It can be leveraged to implement various attacks, such as jailbreaking, hijacking, privacy breaches, Denial-of-Service, and the generation of sponge examples.
arXiv Detail & Related papers (2025-05-26T12:38:58Z)
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs [48.76864299749205]
Video-based large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs.
arXiv Detail & Related papers (2025-01-02T03:52:22Z)
Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE. It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z)
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
AnyAttack: Towards Large-scale Self-supervised Generation of Targeted Adversarial Examples for Vision-Language Models [41.044385916368455]
Vision-Language Models (VLMs) are vulnerable to image-based adversarial attacks. We propose AnyAttack, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision.
arXiv Detail & Related papers (2024-10-07T09:45:18Z)
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts [25.661444231400772]
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs) These advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. We introduce Arondight, a standardized red team framework tailored specifically for VLMs.
arXiv Detail & Related papers (2024-07-21T04:37:11Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks [41.213482317141356]
Augmenting Large Language Models with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs) In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach.
arXiv Detail & Related papers (2024-05-07T15:29:48Z)
Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs) A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP) Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.