Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts
- URL: http://arxiv.org/abs/2504.12782v1
- Date: Thu, 17 Apr 2025 09:29:30 GMT
- Title: Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts
- Authors: Leyang Li, Shilin Lu, Yan Ren, Adams Wai-Kin Kong,
- Abstract summary: We introduce a finetuning framework, dubbed ANT, which guides deNoising Trajectories to avoid unwanted concepts.<n>ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages.<n>For single-concept erasure, we propose an augmentation-enhanced weight saliency map, enabling more thorough and efficient erasure.<n>For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance.
- Score: 12.04985139116705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT
Related papers
- Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion [56.35484513848296]
This research introduces continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models.
We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts.
arXiv Detail & Related papers (2025-03-17T23:17:16Z) - Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models [24.15603438969762]
Interpret then Deactivate (ItD) is a novel framework to enable precise concept removal in T2I diffusion models.<n>ItD uses a sparse autoencoder to interpret each concept as a combination of multiple features.<n>It can be easily extended to erase multiple concepts without requiring further training.
arXiv Detail & Related papers (2025-03-12T14:46:40Z) - SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models [41.284399182295026]
We introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure.
SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure.
arXiv Detail & Related papers (2025-03-10T14:40:01Z) - Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations [10.86252546314626]
Text-to-image generative models are prone to adversarial attacks and inadvertently generate unsafe, unethical content.<n>We propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation.<n>Our method yields an improvement of $mathbf20.01%$ in unsafe concept removal, is effective in style manipulation, and is $mathbfsim5$x faster than current state-of-the-art.
arXiv Detail & Related papers (2025-01-31T11:52:47Z) - DuMo: Dual Encoder Modulation Network for Precise Concept Erasure [75.05165577219425]
We propose our Dual encoder Modulation network (DuMo) which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts.<n>Our method achieves state-of-the-art performance on Explicit Content Erasure, Cartoon Concept Removal and Artistic Style Erasure, clearly outperforming alternative methods.
arXiv Detail & Related papers (2025-01-02T07:47:34Z) - EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers [33.195628798316754]
EraseAnything is the first method specifically developed to address concept erasure within the latest flow-based T2I framework.
We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer.
We propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones.
arXiv Detail & Related papers (2024-12-29T09:42:53Z) - AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors [61.007590285263376]
Security concerns have driven researchers to unlearn inappropriate concepts through fine-tuning.<n>Recent fine-tuning methods exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts.<n>We propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue.
arXiv Detail & Related papers (2024-12-28T04:44:07Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models.
The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure.
Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.