To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
- URL: http://arxiv.org/abs/2310.11868v4
- Date: Sun, 7 Jul 2024 23:10:59 GMT
- Title: To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
- Authors: Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu,
- Abstract summary: diffusion models (DMs) have revolutionized the generation of realistic and complex images.
DMs also introduce potential safety hazards, such as producing harmful content and infringing data copyrights.
Despite the development of safety-driven unlearning techniques, doubts about their efficacy persist.
- Score: 22.75295925610285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models. Through extensive benchmarking, we evaluate the robustness of widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safetydriven unlearning techniques when applied to DMs. Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: There exist AI generations that may be offensive in nature.
Related papers
- Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models [63.43422118066493]
Machine unlearning (MU) is a crucial foundation for developing safe, secure, and trustworthy GenAI models.
Traditional MU methods often rely on stringent assumptions and require access to real data.
This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models.
arXiv Detail & Related papers (2024-09-17T14:12:50Z) - Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey [5.300811350105823]
Diffusion models (DMs) have achieved state-of-the-art performance on various generative tasks.
Recent studies have shown that DMs are prone to a wide range of attacks.
arXiv Detail & Related papers (2024-08-06T18:52:17Z) - Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models [42.734578139757886]
Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks.
The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks.
This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning.
arXiv Detail & Related papers (2024-05-24T05:47:23Z) - Robust Diffusion Models for Adversarial Purification [28.313494459818497]
Diffusion models (DMs) based adversarial purification (AP) has shown to be the most powerful alternative to adversarial training (AT)
We propose a novel robust reverse process with adversarial guidance, which is independent of given pre-trained DMs.
This robust guidance can not only ensure to generate purified examples retaining more semantic content but also mitigate the accuracy-robustness trade-off of DMs.
arXiv Detail & Related papers (2024-03-24T08:34:08Z) - UnlearnCanvas: Stylized Image Dataset for Enhanced Machine Unlearning Evaluation in Diffusion Models [31.48739583108113]
diffusion models (DMs) have demonstrated unprecedented capabilities in text-to-image generation and are widely used in diverse applications.
They have also raised significant societal concerns, such as the generation of harmful content and copyright disputes.
Machine unlearning (MU) has emerged as a promising solution, capable of removing undesired generative capabilities from DMs.
arXiv Detail & Related papers (2024-02-19T05:25:53Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Understanding the Vulnerability of Skeleton-based Human Activity Recognition via Black-box Attack [53.032801921915436]
Human Activity Recognition (HAR) has been employed in a wide range of applications, e.g. self-driving cars.
Recently, the robustness of skeleton-based HAR methods have been questioned due to their vulnerability to adversarial attacks.
We show such threats exist, even when the attacker only has access to the input/output of the model.
We propose the very first black-box adversarial attack approach in skeleton-based HAR called BASAR.
arXiv Detail & Related papers (2022-11-21T09:51:28Z) - Exploring Adversarially Robust Training for Unsupervised Domain
Adaptation [71.94264837503135]
Unsupervised Domain Adaptation (UDA) methods aim to transfer knowledge from a labeled source domain to an unlabeled target domain.
This paper explores how to enhance the unlabeled data robustness via AT while learning domain-invariant features for UDA.
We propose a novel Adversarially Robust Training method for UDA accordingly, referred to as ARTUDA.
arXiv Detail & Related papers (2022-02-18T17:05:19Z) - How Robust are Randomized Smoothing based Defenses to Data Poisoning? [66.80663779176979]
We present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality.
We propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers.
Our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods.
arXiv Detail & Related papers (2020-12-02T15:30:21Z) - Stylized Adversarial Defense [105.88250594033053]
adversarial training creates perturbation patterns and includes them in the training set to robustify the model.
We propose to exploit additional information from the feature space to craft stronger adversaries.
Our adversarial training approach demonstrates strong robustness compared to state-of-the-art defenses.
arXiv Detail & Related papers (2020-07-29T08:38:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.