Distilling Adversarial Prompts from Safety Benchmarks: Report for the
Adversarial Nibbler Challenge
- URL: http://arxiv.org/abs/2309.11575v1
- Date: Wed, 20 Sep 2023 18:25:44 GMT
- Title: Distilling Adversarial Prompts from Safety Benchmarks: Report for the
Adversarial Nibbler Challenge
- Authors: Manuel Brack, Patrick Schramowski, Kristian Kersting
- Abstract summary: Text-conditioned image generation models have recently achieved astonishing image quality and alignment results.
Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content.
As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks.
Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
- Score: 32.140659176912735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-conditioned image generation models have recently achieved astonishing
image quality and alignment results. Consequently, they are employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the web, they also produce
unsafe content. As a contribution to the Adversarial Nibbler challenge, we
distill a large set of over 1,000 potential adversarial inputs from existing
safety benchmarks. Our analysis of the gathered prompts and corresponding
images demonstrates the fragility of input filters and provides further
insights into systematic safety issues in current generative image models.
Related papers
- When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models [0.0]
This study investigates the ability of ten popular Stable Diffusion models to generate harmful images.
We demonstrate that these models respond to harmful prompts by generating inappropriate content.
Our findings demonstrate a complete lack of any refusal behavior or safety measures in the models observed.
arXiv Detail & Related papers (2024-11-23T10:42:43Z) - Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors [62.63467652611788]
We introduce SEMI-TRUTHS, featuring 27,600 real images, 223,400 masks, and 1,472,700 AI-augmented images.
Each augmented image is accompanied by metadata for standardized and targeted evaluation of detector robustness.
Our findings suggest that state-of-the-art detectors exhibit varying sensitivities to the types and degrees of perturbations, data distributions, and augmentation methods used.
arXiv Detail & Related papers (2024-11-12T01:17:27Z) - UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images [29.913089752247362]
We propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers.
First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe.
We then evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers powered by general-purpose visual language models.
arXiv Detail & Related papers (2024-05-06T13:57:03Z) - Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [19.06501699814924]
We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts.
The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
arXiv Detail & Related papers (2024-02-14T22:21:12Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - Safe and Robust Watermark Injection with a Single OoD Image [90.71804273115585]
Training a high-performance deep neural network requires large amounts of data and computational resources.
We propose a safe and robust backdoor-based watermark injection technique.
We induce random perturbation of model parameters during watermark injection to defend against common watermark removal attacks.
arXiv Detail & Related papers (2023-09-04T19:58:35Z) - Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety
of Text-to-Image Models [6.475537049815622]
Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.
arXiv Detail & Related papers (2023-05-22T15:02:40Z) - Surveillance Face Anti-spoofing [81.50018853811895]
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks.
We propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality.
A large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
arXiv Detail & Related papers (2023-01-03T07:09:57Z) - Robust Real-World Image Super-Resolution against Adversarial Attacks [115.04009271192211]
adversarial image samples with quasi-imperceptible noises could threaten deep learning SR models.
We propose a robust deep learning framework for real-world SR that randomly erases potential adversarial noises.
Our proposed method is more insensitive to adversarial attacks and presents more stable SR results than existing models and defenses.
arXiv Detail & Related papers (2022-07-31T13:26:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.