Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
- URL: http://arxiv.org/abs/2602.01025v1
- Date: Sun, 01 Feb 2026 05:18:47 GMT
- Title: Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
- Authors: Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang,
- Abstract summary: Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text.<n> multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses.<n>Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models.<n>We propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space.
- Score: 32.08069972778743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.
Related papers
- LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback [31.15245650762331]
We propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity.<n>LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt.<n>Our evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms.
arXiv Detail & Related papers (2025-10-07T09:40:20Z) - Imperceptible Jailbreaking against Large Language Models [107.76039200173528]
We introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors.<n>By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen.<n>We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses.
arXiv Detail & Related papers (2025-10-06T17:03:50Z) - IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [70.43466586161345]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.<n>For instance, our challenge set ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models [15.582860145268553]
JailFuzzer is a novel fuzzing framework driven by large language model (LLM) agents.<n>It generates natural and semantically coherent prompts, reducing the likelihood of detection by traditional defenses.<n>It achieves a high success rate in jailbreak attacks with minimal query overhead.
arXiv Detail & Related papers (2024-08-01T12:54:46Z) - Failures to Find Transferable Image Jailbreaks Between Vision-Language Models [20.385314634225978]
We focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs.<n>We find that transferable gradient-based image jailbreaks are extremely difficult to obtain.
arXiv Detail & Related papers (2024-07-21T16:27:24Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.