BSPA: Exploring Black-box Stealthy Prompt Attacks against Image
Generators
- URL: http://arxiv.org/abs/2402.15218v1
- Date: Fri, 23 Feb 2024 09:28:16 GMT
- Title: BSPA: Exploring Black-box Stealthy Prompt Attacks against Image
Generators
- Authors: Yu Tian, Xiao Yang, Yinpeng Dong, Heming Yang, Hang Su, Jun Zhu
- Abstract summary: Large image generators offer significant transformative potential across diverse sectors.
Some studies reveal that image generators are notably susceptible to attacks and generate Not Suitable For Work (NSFW) contents.
We introduce a black-box stealthy prompt attack that adopts a retriever to simulate attacks from API users.
- Score: 43.23698370787517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extremely large image generators offer significant transformative potential
across diverse sectors. It allows users to design specific prompts to generate
realistic images through some black-box APIs. However, some studies reveal that
image generators are notably susceptible to attacks and generate Not Suitable
For Work (NSFW) contents by manually designed toxin texts, especially
imperceptible to human observers. We urgently need a multitude of universal and
transferable prompts to improve the safety of image generators, especially
black-box-released APIs. Nevertheless, they are constrained by labor-intensive
design processes and heavily reliant on the quality of the given instructions.
To achieve this, we introduce a black-box stealthy prompt attack (BSPA) that
adopts a retriever to simulate attacks from API users. It can effectively
harness filter scores to tune the retrieval space of sensitive words for
matching the input prompts, thereby crafting stealthy prompts tailored for
image generators. Significantly, this approach is model-agnostic and requires
no internal access to the model's features, ensuring its applicability to a
wide range of image generators. Building on BSPA, we have constructed an
automated prompt tool and a comprehensive prompt attack dataset (NSFWeval).
Extensive experiments demonstrate that BSPA effectively explores the security
vulnerabilities in a variety of state-of-the-art available black-box models,
including Stable Diffusion XL, Midjourney, and DALL-E 2/3. Furthermore, we
develop a resilient text filter and offer targeted recommendations to ensure
the security of image generators against prompt attacks in the future.
Related papers
- Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors [16.495996266157274]
Large text-conditional image generative models can generate high-quality, realistic images from textual prompts.
In this paper, we demonstrate the possibility of bias injection threat by an adversary who backdoors such models with a small number of malicious data samples.
We present a novel framework that enables efficient generation of poisoning samples with composite (multi-word) triggers for such an attack.
arXiv Detail & Related papers (2024-06-21T14:53:19Z) - ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [18.3621509910395]
We propose a novel Automatic Red-Teaming framework, ART, to evaluate the safety risks of text-to-image models.
With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models.
We also introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models.
arXiv Detail & Related papers (2024-05-24T07:44:27Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Latent Guard is a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder.
Our proposed framework is composed of a data generation pipeline specific to the task.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! [52.0855711767075]
EvoSeed is an evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples.
We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Model.
Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers.
arXiv Detail & Related papers (2024-02-07T09:39:29Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via
Substitution [22.882337899780968]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models [54.19289900203071]
The rise in popularity of text-to-image generative artificial intelligence has attracted widespread public interest.
We demonstrate that this technology can be attacked to generate content that subtly manipulates its users.
We propose a Backdoor Attack on text-to-image Generative Models (BAGM)
Our attack is the first to target three popular text-to-image generative models across three stages of the generative process.
arXiv Detail & Related papers (2023-07-31T08:34:24Z) - SneakyPrompt: Jailbreaking Text-to-image Generative Models [20.645304189835944]
We propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models.
Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter.
Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models.
arXiv Detail & Related papers (2023-05-20T03:41:45Z) - Mask and Restore: Blind Backdoor Defense at Test Time with Masked
Autoencoder [57.739693628523]
We propose a framework for blind backdoor defense with Masked AutoEncoder (BDMAE)
BDMAE detects possible triggers in the token space using image structural similarity and label consistency between the test image and MAE restorations.
Our approach is blind to the model restorations, trigger patterns and image benignity.
arXiv Detail & Related papers (2023-03-27T19:23:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.