Related papers: HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

URL: http://arxiv.org/abs/2408.13896v3
Date: Sun, 15 Dec 2024 05:13:26 GMT
Title: HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models
Authors: Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Bai, Yang Liu, Qing Guo,
Abstract summary: Text-to-Image(T2I) models have achieved remarkable success in image generation and editing.<n>These models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.<n>We propose HTS-Attack, a token search attack method.
Score: 28.28898114141277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

Related papers

T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models [20.740929360321747]
Text-to-Image (T2I) generation poses risks related to the production of inappropriate or harmful content. We propose TCBS-Attack, a query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. Our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models.
arXiv Detail & Related papers (2025-04-15T11:53:40Z)
Unified Prompt Attack Against Text-to-Image Generation Models [30.24530622359188]
We propose UPAM, a framework to evaluate the robustness of T2I models from an attack perspective. UPAM unifies the attack on both textual and visual defenses. It also enables gradient-based optimization, overcoming reliance on enumeration for improved efficiency and effectiveness.
arXiv Detail & Related papers (2025-02-23T03:36:18Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [104.94706600050557]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
Multimodal Pragmatic Jailbreak on Text-to-image Models [43.67831238116829]
This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text. We benchmark nine representative T2I models, including two close-source commercial models. All tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from 8% to 74%.
arXiv Detail & Related papers (2024-09-27T21:23:46Z)
Perception-guided Jailbreak against Text-to-Image Models [18.825079959947857]
We propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
arXiv Detail & Related papers (2024-08-20T13:40:25Z)
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization [20.958826487430194]
Red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. We propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works.
arXiv Detail & Related papers (2024-08-18T03:16:59Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models [10.70975463369742]
We present the Jailbreaking Prompt Attack (JPA) JPA searches for the target malicious concepts in the text embedding space using a group of antonyms. A prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.
arXiv Detail & Related papers (2024-04-02T09:49:35Z)
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [150.57983348059528]
PRISM is an algorithm that automatically identifies human-interpretable and transferable prompts. It can effectively generate desired concepts given only black-box access to T2I models. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images.
arXiv Detail & Related papers (2024-03-28T02:35:53Z)
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts [16.317849859000074]
GuardT2I is a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Our experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator.
arXiv Detail & Related papers (2024-03-03T09:04:34Z)
Direct Consistency Optimization for Compositional Text-to-Image Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z)
Cheating Suffix: Targeted Attack to Text-To-Image Diffusion Models with Multi-Modal Priors [59.43303903348258]
Diffusion models have been widely deployed in various image generation tasks. They face challenges of being maliciously exploited to generate harmful or sensitive images. We propose a targeted attack method named MMP-Attack.
arXiv Detail & Related papers (2024-02-02T12:39:49Z)
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z)
BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images) We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
Towards Query-Efficient Black-Box Adversary with Zeroth-Order Natural Gradient Descent [92.4348499398224]
Black-box adversarial attack methods have received special attentions owing to their practicality and simplicity. We propose a zeroth-order natural gradient descent (ZO-NGD) method to design the adversarial attacks. ZO-NGD can obtain significantly lower model query complexities compared with state-of-the-art attack methods.
arXiv Detail & Related papers (2020-02-18T21:48:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.