Related papers: Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge

Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge

URL: http://arxiv.org/abs/2309.11575v1
Date: Wed, 20 Sep 2023 18:25:44 GMT
Title: Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
Authors: Manuel Brack, Patrick Schramowski, Kristian Kersting
Abstract summary: Text-conditioned image generation models have recently achieved astonishing image quality and alignment results. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content. As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks. Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
Score: 32.140659176912735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-conditioned image generation models have recently achieved astonishing image quality and alignment results. Consequently, they are employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content. As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks. Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.

Related papers

MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content. It is crucial to identify such unsafe images based on established safety rules. Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z)
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization [29.378296359782585]
Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. Current efforts to prevent inappropriate image generation for T2I models are easy to bypass and vulnerable to adversarial attacks. We propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation.
arXiv Detail & Related papers (2024-12-05T05:12:30Z)
When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models [0.0]
This study investigates the ability of ten popular Stable Diffusion models to generate harmful images. We demonstrate that these models respond to harmful prompts by generating inappropriate content. Our findings demonstrate a complete lack of any refusal behavior or safety measures in the models observed.
arXiv Detail & Related papers (2024-11-23T10:42:43Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors [62.63467652611788]
We introduce SEMI-TRUTHS, featuring 27,600 real images, 223,400 masks, and 1,472,700 AI-augmented images. Each augmented image is accompanied by metadata for standardized and targeted evaluation of detector robustness. Our findings suggest that state-of-the-art detectors exhibit varying sensitivities to the types and degrees of perturbations, data distributions, and augmentation methods used.
arXiv Detail & Related papers (2024-11-12T01:17:27Z)
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images [29.913089752247362]
We propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe. We then evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers powered by general-purpose visual language models.
arXiv Detail & Related papers (2024-05-06T13:57:03Z)
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [19.06501699814924]
We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
arXiv Detail & Related papers (2024-02-14T22:21:12Z)
SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images. Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules. Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
Safe and Robust Watermark Injection with a Single OoD Image [90.71804273115585]
Training a high-performance deep neural network requires large amounts of data and computational resources. We propose a safe and robust backdoor-based watermark injection technique. We induce random perturbation of model parameters during watermark injection to defend against common watermark removal attacks.
arXiv Detail & Related papers (2023-09-04T19:58:35Z)
Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models [6.475537049815622]
Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.
arXiv Detail & Related papers (2023-05-22T15:02:40Z)
Surveillance Face Anti-spoofing [81.50018853811895]
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. We propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality. A large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
arXiv Detail & Related papers (2023-01-03T07:09:57Z)
Robust Real-World Image Super-Resolution against Adversarial Attacks [115.04009271192211]
adversarial image samples with quasi-imperceptible noises could threaten deep learning SR models. We propose a robust deep learning framework for real-world SR that randomly erases potential adversarial noises. Our proposed method is more insensitive to adversarial attacks and presents more stable SR results than existing models and defenses.
arXiv Detail & Related papers (2022-07-31T13:26:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.