Related papers: TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis

TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis

URL: http://arxiv.org/abs/2505.08804v1
Date: Sun, 11 May 2025 06:32:33 GMT
Title: TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
Authors: Longtian Wang, Xiaofei Xie, Tianlin Li, Yuhan Zhi, Chao Shen,
Abstract summary: We introduce TokenProber, a method designed for sensitivity-aware differential testing.<n>Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content.<n>Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness.
Score: 19.73325740171627
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. To mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce TokenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content. Thus, we conduct a fine-grained analysis of the impact of specific words within prompts, distinguishing between dirty words that are essential for NSFW content generation and discrepant words that highlight the different sensitivity assessments between T2I models and safety checkers. Through the sensitivity-aware mutation, TokenProber generates adversarial prompts, striking a balance between maintaining NSFW content generation and evading detection. Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness in bypassing safety filters compared to existing methods (e.g., 54%+ increase on average), highlighting TokenProber's ability to uncover robustness issues in the existing refusal mechanisms.

Related papers

NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation [47.03824997129498]
"jailbreak" attacks in large language models bypass restrictions through subtle prompt modifications.<n>PromptSan is a novel approach to detoxify harmful prompts without altering model architecture.<n>PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics.
arXiv Detail & Related papers (2025-06-23T06:17:30Z)
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models [65.91565607573786]
Text-to-image (T2I) models can be misused to generate harmful content, including nudity or violence.<n>Recent research on red-teaming and adversarial attacks against T2I models has notable limitations.<n>We propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities.
arXiv Detail & Related papers (2025-06-11T09:09:12Z)
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models [73.6716695218951]
Over-refusal is a phenomenon known as $textitover-refusal$ that reduces the practical utility of T2I models.<n>We present OVERT ($textbfOVE$r-$textbfR$efusal evaluation on $textbfT$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors.
arXiv Detail & Related papers (2025-05-27T15:42:46Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.<n>We propose SC-Pro, a training-free framework that easily defends against adversarial attacks generating NSFW images.
arXiv Detail & Related papers (2025-01-09T16:43:21Z)
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models [39.11841245506388]
Malicious users often exploit text-to-image (T2I) models to generate Not-Safe-for-Work (NSFW) images.<n>We introduce AEIOU, a framework that is Adaptable, Efficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2I models.
arXiv Detail & Related papers (2024-12-24T03:17:45Z)
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [20.37481116837779]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms. We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z)
ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model. We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.