TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2503.07389v1
- Date: Mon, 10 Mar 2025 14:37:53 GMT
- Title: TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
- Authors: Ruidong Chen, Honglin Guo, Lanjun Wang, Chenyu Zhang, Weizhi Nie, An-An Liu,
- Abstract summary: Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
- Score: 45.393001061726366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: http://github.com/ddgoodgood/TRCE. CAUTION: This paper includes model-generated content that may contain offensive material.
Related papers
- On the Vulnerability of Concept Erasure in Diffusion Models [13.916443687966039]
Research on machine unlearning has developed various concept erasure methods, which aim to remove the effect of unwanted data through post-hoc training.<n>We show these erasure techniques are vulnerable, where images of supposedly erased concepts can still be generated using adversarially crafted prompts.<n>We introduce RECORD, a coordinate-descent-based algorithm that discovers prompts capable of eliciting the generation of erased content.
arXiv Detail & Related papers (2025-02-24T17:26:01Z) - Continuous Concepts Removal in Text-to-image Diffusion Models [27.262721132177845]
Concerns have been raised about the potential for text-to-image models to create content that infringes on copyrights or depicts disturbing subject matter.
We propose a novel approach called CCRT that includes a designed knowledge distillation paradigm.
It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts.
arXiv Detail & Related papers (2024-11-30T20:40:10Z) - Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - Growth Inhibitors for Suppressing Inappropriate Image Concepts in Diffusion Models [35.2881940850787]
Text-to-image diffusion models inadvertently learn inappropriate concepts from vast and unfiltered training data.<n>Our method effectively captures the manifestation of subtle words at the image level, enabling direct and efficient erasure of target concepts.
arXiv Detail & Related papers (2024-08-02T05:17:14Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images.
We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z) - Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion
Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models.
Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.