Related papers: Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models

Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models

URL: http://arxiv.org/abs/2509.22400v1
Date: Fri, 26 Sep 2025 14:26:52 GMT
Title: Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
Authors: Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Ke Xu,
Abstract summary: We propose a novel framework VARE that enables stable concept erasure in visual autoregressive models.<n>We then introduce S-VARE, a novel and effective concept erasure method designed for VAR.<n>Our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation.
Score: 48.34555526275907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by na\"ive fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.

Related papers

CGCE: Classifier-Guided Concept Erasure in Generative Models [53.7410000675294]
Concept erasure has been developed to remove undesirable concepts from pre-trained models.<n>Existing methods remain vulnerable to adversarial attacks that can regenerate the erased content.<n>We introduce an efficient plug-and-play framework that provides robust concept erasure for diverse generative models.
arXiv Detail & Related papers (2025-11-08T05:38:18Z)
Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models [27.672305802461377]
We introduce a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process.<n>We achieve superior completeness and robustness while preserving locality and image quality.<n>This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.
arXiv Detail & Related papers (2025-10-26T22:04:17Z)
VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation [57.36681904639463]
Methods to safeguard autoregressive text-to-image models remain underexplored.<n>We propose Visual Contrast Exploitation (VCE), a novel framework that precisely decouples unsafe concepts from their associated content semantics.<n>Our experiments demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts.
arXiv Detail & Related papers (2025-09-21T09:00:27Z)
Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness [4.23067546195708]
textbfSCORE (Secure and Concept-Oriented Robust Erasure) is a novel framework for robust concept removal in diffusion models.<n>SCORE sets a new standard for secure and robust concept erasure in diffusion models.
arXiv Detail & Related papers (2025-09-15T15:05:50Z)
FADE: Adversarial Concept Erasure in Flow Models [4.774890908509861]
We propose a novel textbfconcept erasure method for text-to-image diffusion models.<n>Our method combines a trajectory-aware fine-tuning strategy with an adversarial objective to ensure the concept is reliably removed.<n>We prove a formal guarantee that our approach minimizes the mutual information between the erased concept and the model's outputs.
arXiv Detail & Related papers (2025-07-16T14:31:21Z)
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [53.937498564603054]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z)
Rethinking the Vulnerability of Concept Erasure and a New Method [9.044763606650646]
Concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning.<n>Recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts.<n>We introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times.
arXiv Detail & Related papers (2025-02-24T17:26:01Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.<n>We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.<n>This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z)
Pruning for Robust Concept Erasing in Diffusion Models [27.67237515704348]
We introduce a new pruning-based strategy for concept erasing. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Experimental results show a significant enhancement in our model's ability to resist adversarial inputs.
arXiv Detail & Related papers (2024-05-26T11:42:20Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.