Related papers: Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

URL: http://arxiv.org/abs/2602.05339v1
Date: Thu, 05 Feb 2026 06:05:24 GMT
Title: Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation
Authors: Yongwoo Kim, Sungmin Cha, Hyunsoo Kim, Jaewon Lee, Donghyun Kim,
Abstract summary: Existing concept erasure approaches focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives.<n>We propose a novel framework, PAIRed Erasing, which reframes concept erasure from simple removal to consistency-preserving semantic realignment.<n>Our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
Score: 17.59828667571619
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.

Related papers

AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models [36.91937453334139]
Concept erasure helps stop diffusion models (DMs) from generating harmful content, but current methods face retention trade off.<n>This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both robustness and retention.
arXiv Detail & Related papers (2026-02-06T15:27:42Z)
SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models [32.35244979539898]
Concept erasure has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image models.<n>We propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously.
arXiv Detail & Related papers (2025-12-15T07:08:35Z)
SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge [51.634837361795434]
SaFeR-CLIP reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods.<n>We also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift.
arXiv Detail & Related papers (2025-11-20T19:00:15Z)
CGCE: Classifier-Guided Concept Erasure in Generative Models [53.7410000675294]
Concept erasure has been developed to remove undesirable concepts from pre-trained models.<n>Existing methods remain vulnerable to adversarial attacks that can regenerate the erased content.<n>We introduce an efficient plug-and-play framework that provides robust concept erasure for diverse generative models.
arXiv Detail & Related papers (2025-11-08T05:38:18Z)
SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing [65.82241040239452]
Concept erasing finetunes weights to unlearn undesirable concepts.<n>Existing methods treat unsafe concept as a fixed word and repeatedly erase it.<n>We introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure.
arXiv Detail & Related papers (2025-06-11T03:21:24Z)
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models [7.68494752148263]
CURE is a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models.<n>The Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes.<n>CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content.
arXiv Detail & Related papers (2025-05-19T03:53:06Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.