SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing
- URL: http://arxiv.org/abs/2506.09363v1
- Date: Wed, 11 Jun 2025 03:21:24 GMT
- Title: SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing
- Authors: Hongguang Zhu, Yunchao Wei, Mengyu Wang, Siyu Jiao, Yan Fang, Jiannan Huang, Yao Zhao,
- Abstract summary: Concept erasing finetunes weights to unlearn undesirable concepts.<n>Existing methods treat unsafe concept as a fixed word and repeatedly erase it.<n>We introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure.
- Score: 65.82241040239452
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss'', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at https://github.com/KevinLight831/SAGE.
Related papers
- Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models [56.35484513848296]
FADE (Fine grained Attenuation for Diffusion Erasure) is an adjacency-aware unlearning algorithm for text-to-image generative models.<n>It removes target concepts with minimal impact on correlated concepts, achieving a 12% improvement in retention performance over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-25T15:49:48Z) - CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models [19.205261933636645]
We introduce CRCE, a novel concept erasure framework.<n>By explicitly modelling coreferential and retained concepts semantically, CRCE enables more precise concept removal.<n>Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks.
arXiv Detail & Related papers (2025-03-18T13:09:01Z) - TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [45.393001061726366]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z) - Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them [21.386640828092524]
Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models.<n>We propose the Adaptive Guided Erasure (AGE) method, which emphdynamically selects optimal target concepts tailored to each undesirable concept.<n>Results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance.
arXiv Detail & Related papers (2025-01-31T08:17:23Z) - Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - RealEra: Semantic-level Concept Erasure via Neighbor-Concept Mining [25.769144703607214]
Concept erasure has been proposed to remove the model's knowledge about protected and inappropriate concepts.
We propose RealEra to address this "concept residue" issue.
We show that RealEra outperforms previous concept erasing methods in terms of superior erasing efficacy, specificity, and generality.
arXiv Detail & Related papers (2024-10-11T17:55:30Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - MACE: Mass Concept Erasure in Diffusion Models [11.12833789743765]
We introduce MACE, a finetuning framework for the task of mass concept erasure.
This task aims to prevent models from generating images that embody unwanted concepts when prompted.
We conduct extensive evaluations of MACE against prior methods across four different tasks.
arXiv Detail & Related papers (2024-03-10T08:50:56Z) - Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models.
The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure.
Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z) - Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images.
We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.