UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
- URL: http://arxiv.org/abs/2405.03486v2
- Date: Thu, 5 Sep 2024 20:23:19 GMT
- Title: UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
- Authors: Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, Yang Zhang,
- Abstract summary: We propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers.
First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe.
We then evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers powered by general-purpose visual language models.
- Score: 29.913089752247362
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the advent of text-to-image models and concerns about their misuse, developers are increasingly relying on image safety classifiers to moderate their generated unsafe images. Yet, the performance of current image safety classifiers remains unknown for both real-world and AI-generated images. In this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers, with a particular focus on the impact of AI-generated images on their performance. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough to mitigate the multifaceted problem of unsafe images. Also, there exists a distribution shift between real-world and AI-generated images in image qualities, styles, and layouts, leading to degraded effectiveness and robustness. Motivated by these findings, we build a comprehensive image moderation tool called PerspectiveVision, which addresses the main drawbacks of existing classifiers with improved effectiveness and robustness, especially on AI-generated images. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.
Related papers
- Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [49.60774626839712]
Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs.
We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images.
We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection [60.960988614701414]
RIGID is a training-free and model-agnostic method for robust AI-generated image detection.
RIGID significantly outperforms existing trainingbased and training-free detectors.
arXiv Detail & Related papers (2024-05-30T14:49:54Z) - Counterfactual Image Generation for adversarially robust and
interpretable Classifiers [1.3859669037499769]
We propose a unified framework leveraging image-to-image translation Generative Adrial Networks (GANs) to produce counterfactual samples.
This is achieved by combining the classifier and discriminator into a single model that attributes real images to their respective classes and flags generated images as "fake"
We show how the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.
arXiv Detail & Related papers (2023-10-01T18:50:29Z) - SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution [21.93748586123046]
We develop and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant NSFW images.
Our framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules.
Results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts.
arXiv Detail & Related papers (2023-09-25T13:20:15Z) - Distilling Adversarial Prompts from Safety Benchmarks: Report for the
Adversarial Nibbler Challenge [32.140659176912735]
Text-conditioned image generation models have recently achieved astonishing image quality and alignment results.
Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content.
As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks.
Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
arXiv Detail & Related papers (2023-09-20T18:25:44Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - Exploring the Robustness of Human Parsers Towards Common Corruptions [99.89886010550836]
We construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models.
Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions.
arXiv Detail & Related papers (2023-09-02T13:32:14Z) - Adversarially-Aware Robust Object Detector [85.10894272034135]
We propose a Robust Detector (RobustDet) based on adversarially-aware convolution to disentangle gradients for model learning on clean and adversarial images.
Our model effectively disentangles gradients and significantly enhances the detection robustness with maintaining the detection ability on clean images.
arXiv Detail & Related papers (2022-07-13T13:59:59Z) - Deep Bayesian Image Set Classification: A Defence Approach against
Adversarial Attacks [32.48820298978333]
Deep neural networks (DNNs) are susceptible to be fooled with nearly high confidence by an adversary.
In practice, the vulnerability of deep learning systems against carefully perturbed images, known as adversarial examples, poses a dire security threat in the physical world applications.
We propose a robust deep Bayesian image set classification as a defence framework against a broad range of adversarial attacks.
arXiv Detail & Related papers (2021-08-23T14:52:44Z) - Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep
Image-to-Image Models against Adversarial Attacks [104.8737334237993]
We present comprehensive investigations into the vulnerability of deep image-to-image models to adversarial attacks.
For five popular image-to-image tasks, 16 deep models are analyzed from various standpoints.
We show that unlike in image classification tasks, the performance degradation on image-to-image tasks can largely differ depending on various factors.
arXiv Detail & Related papers (2021-04-30T14:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.