ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
- URL: http://arxiv.org/abs/2602.00350v1
- Date: Fri, 30 Jan 2026 21:56:50 GMT
- Title: ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
- Authors: Ignacy Kolton, Kacper Marzol, Paweł Batorski, Marcin Mazur, Paul Swoboda, Przemysław Spurek,
- Abstract summary: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models.<n>Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations.<n>We introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem.
- Score: 12.021923446217722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe
Related papers
- Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation [81.40978077888693]
Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance.<n>Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens.<n>We integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations.
arXiv Detail & Related papers (2026-03-05T04:45:49Z) - Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models [7.17300076441681]
SurgUn is a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models.<n>Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones.<n>We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept.
arXiv Detail & Related papers (2026-03-01T08:07:14Z) - Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion [0.0]
Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility.<n>We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process.<n>Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones.
arXiv Detail & Related papers (2026-01-06T17:52:02Z) - Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z) - Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models [38.38751366738881]
Concept erasure techniques have been widely deployed in T2I diffusion models to prevent inappropriate content generation for safety and copyright considerations.<n> established erasure methods exhibit degraded effectiveness, raising questions about their true mechanisms.<n>We propose textbfRevAm, a trajectory optimization framework that resurrects erased concepts by dynamically steering the denoising process.
arXiv Detail & Related papers (2025-09-30T07:46:19Z) - LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling [38.700993166492495]
We propose a dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model.<n>Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition.
arXiv Detail & Related papers (2025-07-01T14:25:09Z) - Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning [65.85335291827086]
This paper tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints.<n>We pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos.<n>For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model.
arXiv Detail & Related papers (2025-03-11T13:50:22Z) - Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models [7.9993879763024065]
This paper presents a theoretical and empirical examination of five commonly used techniques for unlearning in diffusion models.<n>We introduce two new evaluation metrics: Concept Retrieval Score (textbfCRS) and Concept Confidence Score (textbfCCS)
arXiv Detail & Related papers (2024-09-09T14:38:31Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Rethinking and Defending Protective Perturbation in Personalized Diffusion Models [21.30373461975769]
We study the fine-tuning process of personalized diffusion models (PDMs) through the lens of shortcut learning.
PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets.
We propose a systematic defense framework that includes data purification and contrastive decoupling learning.
arXiv Detail & Related papers (2024-06-27T07:14:14Z) - Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient [20.698305103879232]
We propose a novel concept domain correction framework named textbfDoCo (textbfDomaintextbfCorrection)<n>By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts.<n>We also introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts.
arXiv Detail & Related papers (2024-05-24T07:47:36Z) - Model Will Tell: Training Membership Inference for Diffusion Models [15.16244745642374]
Training Membership Inference (TMI) task aims to determine whether a specific sample has been used in the training process of a target model.
In this paper, we explore a novel perspective for the TMI task by leveraging the intrinsic generative priors within the diffusion model.
arXiv Detail & Related papers (2024-03-13T12:52:37Z) - Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey [73.86861112002593]
We present a comprehensive review of recent diffusion model-based methods on image restoration.<n>We classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR.<n>We propose five potential and challenging directions for the future research of diffusion model-based IR.
arXiv Detail & Related papers (2023-08-18T08:40:38Z) - Exploiting Diffusion Prior for Real-World Image Super-Resolution [75.5898357277047]
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution.
By employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model.
arXiv Detail & Related papers (2023-05-11T17:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.