Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
- URL: http://arxiv.org/abs/2509.22400v1
- Date: Fri, 26 Sep 2025 14:26:52 GMT
- Title: Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
- Authors: Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Ke Xu,
- Abstract summary: We propose a novel framework VARE that enables stable concept erasure in visual autoregressive models.<n>We then introduce S-VARE, a novel and effective concept erasure method designed for VAR.<n>Our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation.
- Score: 48.34555526275907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by na\"ive fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
Related papers
- CGCE: Classifier-Guided Concept Erasure in Generative Models [53.7410000675294]
Concept erasure has been developed to remove undesirable concepts from pre-trained models.<n>Existing methods remain vulnerable to adversarial attacks that can regenerate the erased content.<n>We introduce an efficient plug-and-play framework that provides robust concept erasure for diverse generative models.
arXiv Detail & Related papers (2025-11-08T05:38:18Z) - Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models [27.672305802461377]
We introduce a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process.<n>We achieve superior completeness and robustness while preserving locality and image quality.<n>This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.
arXiv Detail & Related papers (2025-10-26T22:04:17Z) - VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation [57.36681904639463]
Methods to safeguard autoregressive text-to-image models remain underexplored.<n>We propose Visual Contrast Exploitation (VCE), a novel framework that precisely decouples unsafe concepts from their associated content semantics.<n>Our experiments demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts.
arXiv Detail & Related papers (2025-09-21T09:00:27Z) - Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness [4.23067546195708]
textbfSCORE (Secure and Concept-Oriented Robust Erasure) is a novel framework for robust concept removal in diffusion models.<n>SCORE sets a new standard for secure and robust concept erasure in diffusion models.
arXiv Detail & Related papers (2025-09-15T15:05:50Z) - FADE: Adversarial Concept Erasure in Flow Models [4.774890908509861]
We propose a novel textbfconcept erasure method for text-to-image diffusion models.<n>Our method combines a trajectory-aware fine-tuning strategy with an adversarial objective to ensure the concept is reliably removed.<n>We prove a formal guarantee that our approach minimizes the mutual information between the erased concept and the model's outputs.
arXiv Detail & Related papers (2025-07-16T14:31:21Z) - TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [53.937498564603054]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z) - Rethinking the Vulnerability of Concept Erasure and a New Method [9.044763606650646]
Concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning.<n>Recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts.<n>We introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times.
arXiv Detail & Related papers (2025-02-24T17:26:01Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.<n>We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.<n>This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Pruning for Robust Concept Erasing in Diffusion Models [27.67237515704348]
We introduce a new pruning-based strategy for concept erasing.
Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons.
Experimental results show a significant enhancement in our model's ability to resist adversarial inputs.
arXiv Detail & Related papers (2024-05-26T11:42:20Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.