Related papers: Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

URL: http://arxiv.org/abs/2510.27285v2
Date: Sat, 08 Nov 2025 05:17:37 GMT
Title: Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Authors: Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Xueming Li, Yue Zhan,
Abstract summary: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation.<n>Most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs.<n>We introduce S-GRACE, which leverage semantic guidance within the concept space to generate adversarial samples and perform erasure training.
Score: 13.578420382949828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

Related papers

Erased or Dormant? Rethinking Concept Erasure Through Reversibility [6.895055915600732]
We evaluate two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion.<n>We show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation.<n>Our findings reveal critical limitations in existing concept erasure approaches.
arXiv Detail & Related papers (2025-05-22T03:26:46Z)
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models [56.35484513848296]
FADE (Fine grained Attenuation for Diffusion Erasure) is an adjacency-aware unlearning algorithm for text-to-image generative models.<n>It removes target concepts with minimal impact on correlated concepts, achieving a 12% improvement in retention performance over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-25T15:49:48Z)
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [53.937498564603054]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z)
Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them [21.386640828092524]
Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models.<n>We propose the Adaptive Guided Erasure (AGE) method, which emphdynamically selects optimal target concepts tailored to each undesirable concept.<n>Results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance.
arXiv Detail & Related papers (2025-01-31T08:17:23Z)
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts [31.232389877218377]
We introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance.<n>We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships.<n>Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts.
arXiv Detail & Related papers (2025-01-16T20:42:17Z)
AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors [61.007590285263376]
Security concerns have driven researchers to unlearn inappropriate concepts through fine-tuning.<n>Recent fine-tuning methods exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts.<n>We propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue.
arXiv Detail & Related papers (2024-12-28T04:44:07Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.<n>The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.<n> concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient [20.698305103879232]
We propose a novel concept domain correction framework named textbfDoCo (textbfDomaintextbfCorrection)<n>By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts.<n>We also introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts.
arXiv Detail & Related papers (2024-05-24T07:47:36Z)
CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection [22.82243087156918]
Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. adversarial perturbations could be easily affected by some adversarial perturbations, leading to substantial accuracy reduction. We propose a novel robustness enhancement framework by first learning the concept of the co-salient objects based on the input group images.
arXiv Detail & Related papers (2024-03-27T13:33:14Z)
Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models. The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure. Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z)
Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.