Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models
- URL: http://arxiv.org/abs/2504.11106v1
- Date: Tue, 15 Apr 2025 11:53:40 GMT
- Title: Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models
- Authors: Jiangtao Liu, Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin,
- Abstract summary: Text-to-Image (T2I) generation poses risks related to the production of inappropriate or harmful content.<n>We propose TCBS-Attack, a query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers.<n>Our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models.
- Score: 20.740929360321747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.
Related papers
- T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.
We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.
Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning [34.73320827764541]
Text-to-Image(T2I) models typically deploy safety filters to prevent the generation of sensitive images.
Recent jailbreaking attack methods manually design prompts for the LLM to generate adversarial prompts.
We propose Reason2Attack(R2A), which aims to enhance the LLM's reasoning capabilities in generating adversarial prompts.
arXiv Detail & Related papers (2025-03-23T08:40:39Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step [62.82566977845765]
We introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process.
Our CoJ attack method can successfully bypass the safeguards of models for over 60% cases.
We also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack.
arXiv Detail & Related papers (2024-10-04T19:04:43Z) - Multimodal Pragmatic Jailbreak on Text-to-image Models [43.67831238116829]
This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text.
We benchmark nine representative T2I models, including two close-source commercial models.
All tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8% to 74%.
arXiv Detail & Related papers (2024-09-27T21:23:46Z) - HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models [28.28898114141277]
Text-to-Image(T2I) models have achieved remarkable success in image generation and editing.<n>These models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.<n>We propose HTS-Attack, a token search attack method.
arXiv Detail & Related papers (2024-08-25T17:33:40Z) - Automatic Jailbreaking of the Text-to-Image Generative AI Systems [76.9697122883554]
We study the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts.
We propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards.
Our framework successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time.
arXiv Detail & Related papers (2024-05-26T13:32:24Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - GuardT2I: Defending Text-to-Image Models from Adversarial Prompts [16.317849859000074]
GuardT2I is a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts.
Our experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator.
arXiv Detail & Related papers (2024-03-03T09:04:34Z) - Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [19.06501699814924]
We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts.
The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
arXiv Detail & Related papers (2024-02-14T22:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.