Related papers: The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

URL: http://arxiv.org/abs/2602.00175v1
Date: Fri, 30 Jan 2026 02:39:51 GMT
Title: The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization
Authors: Manyi Li, Yufan Liu, Lai Jiang, Bing Li, Yuming Li, Weiming Hu,
Abstract summary: Unlearning-based defenses claim to purge Not-Safe-For-Work concepts from diffusion models (DMs)<n>We show that unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories.<n>We propose IVO, a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings.
Score: 51.835894707552946
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this "forgetting" is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at anonymous.4open.science/r/IVO/. Warning: This paper has unsafe images that may offend some readers.

Related papers

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation [81.40978077888693]
Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance.<n>Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens.<n>We integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations.
arXiv Detail & Related papers (2026-03-05T04:45:49Z)
ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models [12.021923446217722]
Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models.<n>Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations.<n>We introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem.
arXiv Detail & Related papers (2026-01-30T21:56:50Z)
Deep Leakage with Generative Flow Matching Denoiser [54.05993847488204]
We introduce a new deep leakage (DL) attack that integrates a generative Flow Matching (FM) prior into the reconstruction process.<n>Our approach consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics.
arXiv Detail & Related papers (2026-01-21T14:51:01Z)
Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations [18.024767641200064]
We propose a model-based perturbation strategy that operates within the latent space of diffusion models.<n>Our method alternates between denoising and inversion while modifying the starting point of the denoising trajectory: of diffusion models.<n>We validate our approach on four benchmark datasets to demonstrate robustness against state-of-the-art inversion attacks.
arXiv Detail & Related papers (2025-10-03T15:18:45Z)
Active Adversarial Noise Suppression for Image Forgery Localization [56.98050814363447]
We introduce an Adversarial Noise Suppression Module (ANSM) that generate a defensive perturbation to suppress the attack effect of adversarial noise.<n>To our best knowledge, this is the first report of adversarial defense in image forgery localization tasks.
arXiv Detail & Related papers (2025-06-15T14:53:27Z)
Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z)
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models [7.68494752148263]
CURE is a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models.<n>The Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes.<n>CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content.
arXiv Detail & Related papers (2025-05-19T03:53:06Z)
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [53.937498564603054]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z)
Improving LLM Unlearning Robustness via Random Perturbations [9.075604660200053]
We show that current state-of-the-art LLM unlearning methods inherently reduce models' robustness.<n>We propose a novel theoretical framework that reframes the unlearning process as backdoor attacks and defenses.
arXiv Detail & Related papers (2025-01-31T15:12:20Z)
Diffusion Models for Adversarial Purification [69.1882221038846]
Adrial purification refers to a class of defense methods that remove adversarial perturbations using a generative model. We propose DiffPure that uses diffusion models for adversarial purification. Our method achieves the state-of-the-art results, outperforming current adversarial training and adversarial purification methods.
arXiv Detail & Related papers (2022-05-16T06:03:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.