Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
- URL: http://arxiv.org/abs/2505.21556v1
- Date: Mon, 26 May 2025 17:27:32 GMT
- Title: Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
- Authors: Hee-Seon Kim, Minbeom Kim, Wonjun Lee, Kihyun Kim, Changick Kim,
- Abstract summary: We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak.<n>Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning.<n>Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks.
- Score: 16.04435108299333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model's safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.
Related papers
- Activation-Guided Local Editing for Jailbreaking Attacks [33.13949817155855]
Token-level jailbreak attacks often produce incoherent or unreadable inputs.<n> prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity.<n>We propose a concise and effective two-stage framework that combines the advantages of these approaches.
arXiv Detail & Related papers (2025-08-01T11:52:24Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - Red Team Diffuser: Exposing Toxic Continuation Vulnerabilities in Vision-Language Models via Reinforcement Learning [27.68654681867373]
We propose a red teaming diffusion model that coordinates adversarial image generation and toxic continuation through reinforcement learning.<n>Our key innovations include dynamic cross-modal attack and stealth-aware optimization.<n> Experimental results demonstrate the effectiveness of RTD, increasing the toxicity rate of LLaVA outputs by 10.69% over text-only baselines.
arXiv Detail & Related papers (2025-03-08T13:51:40Z) - xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs [9.312913540732445]
Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks.<n>Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm.<n>We introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with "nudging"
arXiv Detail & Related papers (2025-01-02T15:15:38Z) - DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak [51.8218217407928]
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs.<n>This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models.
arXiv Detail & Related papers (2024-12-23T12:44:54Z) - PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization [8.819101213981053]
We propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity.<n>Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM.<n>We enhance these features through bidirectional cross-modal interaction optimization.<n>Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods.
arXiv Detail & Related papers (2024-12-08T11:14:16Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.