Related papers: Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

URL: http://arxiv.org/abs/2509.21360v1
Date: Sun, 21 Sep 2025 11:22:32 GMT
Title: Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models
Authors: Xingkai Peng, Jun Jiang, Meng Tong, Shuai Li, Weiming Zhang, Nenghai Yu, Kejiang Chen,
Abstract summary: Multimodal Prompt Decoupling Attack (MPDA)<n>MPDA uses image modality to separate the harmful semantic components of the original unsafe prompt.<n>Visual language model generates image captions to ensure semantic consistency between the generated NSFW images and the original unsafe prompts.
Score: 73.43013217318965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.

Related papers

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z)
Iterative Prompt Refinement for Safer Text-to-Image Generation [4.174845397893041]
Existing safety methods typically refine prompts using large language models (LLMs)<n>We propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images.<n>Our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content.
arXiv Detail & Related papers (2025-09-17T07:16:06Z)
NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation [47.03824997129498]
"jailbreak" attacks in large language models bypass restrictions through subtle prompt modifications.<n>PromptSan is a novel approach to detoxify harmful prompts without altering model architecture.<n>PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics.
arXiv Detail & Related papers (2025-06-23T06:17:30Z)
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models [65.91565607573786]
Text-to-image (T2I) models can be misused to generate harmful content, including nudity or violence.<n>Recent research on red-teaming and adversarial attacks against T2I models has notable limitations.<n>We propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities.
arXiv Detail & Related papers (2025-06-11T09:09:12Z)
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization [19.44247617251449]
We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback.<n>It achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5% to 99.0%.<n>It generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images.
arXiv Detail & Related papers (2025-05-25T05:13:06Z)
SafeText: Safe Text-to-image Models via Aligning the Text Encoder [38.14026164194725]
Text-to-image models can generate harmful images when presented with unsafe prompts.<n>We propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module.<n>Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts.
arXiv Detail & Related papers (2025-02-28T01:02:57Z)
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models [38.45239843869313]
Text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions.<n>T2I models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content.<n>We present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models.
arXiv Detail & Related papers (2025-01-07T05:39:21Z)
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models [80.77246856082742]
Safety Snowball Agent (SSA) is a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs.<n>Our experiments demonstrate that ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs.
arXiv Detail & Related papers (2024-11-18T11:58:07Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
Certifying LLM Safety against Adversarial Prompting [70.96868018621167]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.<n>We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
SneakyPrompt: Jailbreaking Text-to-image Generative Models [20.645304189835944]
We propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models.
arXiv Detail & Related papers (2023-05-20T03:41:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.