AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
- URL: http://arxiv.org/abs/2510.24034v1
- Date: Tue, 28 Oct 2025 03:32:14 GMT
- Title: AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
- Authors: Yufan Liu, Wanqian Zhang, Huashan Chen, Lin Wang, Xiaojun Jia, Zheng Lin, Weiping Wang,
- Abstract summary: AutoPrompT is a black-box framework that automatically generates human-readable adversarial suffixes for benign prompts.<n>In this paper, we introduce a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter.<n>Experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts.
- Score: 40.29708628615311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).
Related papers
- Learning to Inject: Automated Prompt Injection via Reinforcement Learning [11.25949760375263]
We propose a reinforcement learning framework that generates universal, transferable adversarial suffixes.<n>Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks.<n>We successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash.
arXiv Detail & Related papers (2026-02-05T15:14:46Z) - Diffusion LLMs are Natural Adversaries for any LLM [50.88535293540971]
We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an emphefficient, amortized inference task<n>Our core insight is that pretrained, non-autoregressive generative LLMs, can serve as powerful surrogates for prompt search.<n>We find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models.
arXiv Detail & Related papers (2025-10-31T19:04:09Z) - Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models [73.43013217318965]
Multimodal Prompt Decoupling Attack (MPDA)<n>MPDA uses image modality to separate the harmful semantic components of the original unsafe prompt.<n>Visual language model generates image captions to ensure semantic consistency between the generated NSFW images and the original unsafe prompts.
arXiv Detail & Related papers (2025-09-21T11:22:32Z) - GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization [19.44247617251449]
We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback.<n>It achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5% to 99.0%.<n>It generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images.
arXiv Detail & Related papers (2025-05-25T05:13:06Z) - Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning [34.73320827764541]
Text-to-Image(T2I) models typically deploy safety filters to prevent the generation of sensitive images.<n>Recent jailbreaking attack methods manually design prompts for the LLM to generate adversarial prompts.<n>We propose Reason2Attack(R2A), which aims to enhance the LLM's reasoning capabilities in generating adversarial prompts.
arXiv Detail & Related papers (2025-03-23T08:40:39Z) - GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs [3.096869664709865]
We introduce Generative Adversarial Suffix Prompter (GASP), a novel framework that can efficiently generate human-readable jailbreak prompts.<n>We show that GASP can produce natural adversarial prompts, significantly improving jailbreak success over baselines, reducing training times, and accelerating inference speed.
arXiv Detail & Related papers (2024-11-21T14:00:01Z) - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead to generation of inappropriate or harmful content.<n>We present a novel method that uses another LLM, called AdvPrompter, to generate human-readable adversarial prompts in seconds.
arXiv Detail & Related papers (2024-04-21T22:18:13Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - Certifying LLM Safety against Adversarial Prompting [70.96868018621167]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.<n>We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.