Related papers: Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression

Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression

URL: http://arxiv.org/abs/2505.19398v2
Date: Thu, 29 May 2025 14:48:02 GMT
Title: Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression
Authors: Yiwei Xie, Ping Liu, Zheng Zhang,
Abstract summary: Un uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges.<n>The concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models.<n>This survey aims to guide researchers toward safer, more ethically aligned generative models.
Score: 10.950528923845955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-Image (T2I) models have demonstrated impressive capabilities in generating high-quality and diverse visual content from natural language prompts. However, uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges. To address these concerns, the concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models while preserving their overall utility. This survey provides a comprehensive overview and in-depth synthesis of concept erasure techniques in T2I diffusion models. We systematically categorize existing approaches along three key dimensions: intervention level, which identifies specific model components targeted for concept removal; optimization structure, referring to the algorithmic strategies employed to achieve suppression; and semantic scope, concerning the complexity and nature of the concepts addressed. This multi-dimensional taxonomy enables clear, structured comparisons across diverse methodologies, highlighting fundamental trade-offs between erasure specificity, generalization, and computational complexity. We further discuss current evaluation benchmarks, standardized metrics, and practical datasets, emphasizing gaps that limit comprehensive assessment, particularly regarding robustness and practical effectiveness. Finally, we outline major challenges and promising future directions, including disentanglement of concept representations, adaptive and incremental erasure strategies, adversarial robustness, and new generative architectures. This survey aims to guide researchers toward safer, more ethically aligned generative models, providing foundational knowledge and actionable recommendations to advance responsible development in generative AI.

Related papers

When Are Concepts Erased From Diffusion Models? [44.89615668122767]
Concept erasure is the ability to selectively prevent a model from generating specific concepts.<n>We propose two conceptual models for the erasure mechanism in diffusion models.<n>To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations.
arXiv Detail & Related papers (2025-05-22T17:59:09Z)
Erased or Dormant? Rethinking Concept Erasure Through Reversibility [8.454050090398713]
We evaluate two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion.<n>We show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation.<n>Our findings reveal critical limitations in existing concept erasure approaches.
arXiv Detail & Related papers (2025-05-22T03:26:46Z)
Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion [56.35484513848296]
This research introduces continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models.<n>We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts.
arXiv Detail & Related papers (2025-03-17T23:17:16Z)
Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization [2.163881720692685]
We introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers into its architecture.<n>Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.<n>We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions.
arXiv Detail & Related papers (2025-02-19T11:10:19Z)
A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models [14.325284311928492]
Text-to-Image (T2I) models have made remarkable progress in generating high-quality, diverse visual content from natural language prompts.<n>Their ability to reproduce copyrighted styles, sensitive imagery, and harmful content raises significant ethical and legal concerns.<n> Concept erasure offers a proactive alternative to external filtering by modifying T2I models to prevent the generation of undesired content.
arXiv Detail & Related papers (2025-02-17T20:51:20Z)
Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence. Recent trends demonstrate the potential homogeneity of these two fields. We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z)
Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models. The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure. Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z)
How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges. We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z)
Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z)
Coarse-to-Fine Concept Bottleneck Models [9.910980079138206]
This work targets ante hoc interpretability, and specifically Concept Bottleneck Models (CBMs) Our goal is to design a framework that admits a highly interpretable decision making process with respect to human understandable concepts, on two levels of granularity. Within this framework, concept information does not solely rely on the similarity between the whole image and general unstructured concepts; instead, we introduce the notion of concept hierarchy to uncover and exploit more granular concept information residing in patch-specific regions of the image scene.
arXiv Detail & Related papers (2023-10-03T14:57:31Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)
A general framework for defining and optimizing robustness [74.67016173858497]
We propose a rigorous and flexible framework for defining different types of robustness properties for classifiers. Our concept is based on postulates that robustness of a classifier should be considered as a property that is independent of accuracy. We develop a very general robustness framework that is applicable to any type of classification model.
arXiv Detail & Related papers (2020-06-19T13:24:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.