Training-Free Safe Denoisers for Safe Use of Diffusion Models
- URL: http://arxiv.org/abs/2502.08011v2
- Date: Thu, 13 Feb 2025 02:01:58 GMT
- Title: Training-Free Safe Denoisers for Safe Use of Diffusion Models
- Authors: Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mi Jung Park,
- Abstract summary: Powerful diffusion models (DMs) are often misused to generate not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten.<n>We develop a practical algorithm that produces high-quality samples while avoiding negation areas of the data distribution.<n>These results hint at the great potential of our training-free safe denoiser for using DMs more safely.
- Score: 49.045799120267915
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.
Related papers
- Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation [70.62656296780074]
We propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe.<n>A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts.<n>Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality.
arXiv Detail & Related papers (2025-05-27T21:34:40Z) - Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions [35.28819408507869]
Concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases.<n>We propose a novel self-discovery approach to identify a semantic direction vector in the embedding space to restrict text embedding within a safe region.<n>Our method can effectively reduce NSFW content and social bias generated by diffusion models compared to several state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-21T12:10:26Z) - Towards SFW sampling for diffusion models via external conditioning [1.0923877073891446]
This article explores the use of external sources for ensuring safe outputs in Score-based generative models (SBMs)<n>Our safe-for-work (SFW) sampler implements a Conditional Trajectory Correction step that guides the samples away from undesired regions in the ambient space.<n>Our experiments on the text-to-image SBM Stable Diffusion validate that the proposed SFW sampler effectively reduces the generation of explicit content.
arXiv Detail & Related papers (2025-05-12T17:27:40Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.
We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization [22.225141381422873]
There is a growing concern about text-to-image diffusion models creating harmful content.
Post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed to mitigate these risks.
We propose the safe generation framework Detect-and-Guide (DAG) to perform self-diagnosis and fine-interpret self-regulation.
DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on real-world prompts.
arXiv Detail & Related papers (2025-03-19T13:37:52Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.
We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)
UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z) - CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.<n>We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z) - Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.
We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.
We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - Revealing the Unseen: Guiding Personalized Diffusion Models to Expose Training Data [10.619162675453806]
Diffusion Models (DMs) have evolved into advanced image generation tools.
FineXtract is a framework for extracting fine-tuning data.
Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method.
arXiv Detail & Related papers (2024-10-03T23:06:11Z) - SIDE: Surrogate Conditional Data Extraction from Diffusion Models [32.18993348942877]
We present textbfSurrogate condItional Data Extraction (SIDE), a framework that constructs data-driven surrogate conditions to enable targeted extraction from any DPM.<n>We show that SIDE can successfully extract training data from so-called safe unconditional models, outperforming baseline attacks even on conditional models.<n>Our work redefines the threat landscape for DPMs, establishing precise conditioning as a fundamental vulnerability and setting a new, stronger benchmark for model privacy evaluation.
arXiv Detail & Related papers (2024-10-03T13:17:06Z) - Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models [63.43422118066493]
Machine unlearning (MU) is a crucial foundation for developing safe, secure, and trustworthy GenAI models.
Traditional MU methods often rely on stringent assumptions and require access to real data.
This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models.
arXiv Detail & Related papers (2024-09-17T14:12:50Z) - Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! [52.0855711767075]
EvoSeed is an evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples.
We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Model.
Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers.
arXiv Detail & Related papers (2024-02-07T09:39:29Z) - To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now [22.75295925610285]
diffusion models (DMs) have revolutionized the generation of realistic and complex images.
DMs also introduce potential safety hazards, such as producing harmful content and infringing data copyrights.
Despite the development of safety-driven unlearning techniques, doubts about their efficacy persist.
arXiv Detail & Related papers (2023-10-18T10:36:34Z) - SafeDiffuser: Safe Planning with Diffusion Probabilistic Models [97.80042457099718]
Diffusion model-based approaches have shown promise in data-driven planning, but there are no safety guarantees.
We propose a new method, called SafeDiffuser, to ensure diffusion probabilistic models satisfy specifications.
We test our method on a series of safe planning tasks, including maze path generation, legged robot locomotion, and 3D space manipulation.
arXiv Detail & Related papers (2023-05-31T19:38:12Z) - Diffusion-Based Adversarial Sample Generation for Improved Stealthiness
and Controllability [62.105715985563656]
We propose a novel framework dubbed Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic adversarial samples.
Our framework can be easily customized for specific tasks such as digital attacks, physical-world attacks, and style-based attacks.
arXiv Detail & Related papers (2023-05-25T21:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.