Related papers: ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

URL: http://arxiv.org/abs/2602.00350v1
Date: Fri, 30 Jan 2026 21:56:50 GMT
Title: ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models
Authors: Ignacy Kolton, Kacper Marzol, Paweł Batorski, Marcin Mazur, Paul Swoboda, Przemysław Spurek,
Abstract summary: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models.<n>Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations.<n>We introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem.
Score: 12.021923446217722
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe

Related papers

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation [81.40978077888693]
Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance.<n>Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens.<n>We integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations.
arXiv Detail & Related papers (2026-03-05T04:45:49Z)
Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models [7.17300076441681]
SurgUn is a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models.<n>Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones.<n>We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept.
arXiv Detail & Related papers (2026-03-01T08:07:14Z)
Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion [0.0]
Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility.<n>We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process.<n>Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones.
arXiv Detail & Related papers (2026-01-06T17:52:02Z)
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z)
Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models [38.38751366738881]
Concept erasure techniques have been widely deployed in T2I diffusion models to prevent inappropriate content generation for safety and copyright considerations.<n> established erasure methods exhibit degraded effectiveness, raising questions about their true mechanisms.<n>We propose textbfRevAm, a trajectory optimization framework that resurrects erased concepts by dynamically steering the denoising process.
arXiv Detail & Related papers (2025-09-30T07:46:19Z)
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling [38.700993166492495]
We propose a dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model.<n>Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition.
arXiv Detail & Related papers (2025-07-01T14:25:09Z)
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning [65.85335291827086]
This paper tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints.<n>We pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos.<n>For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model.
arXiv Detail & Related papers (2025-03-11T13:50:22Z)
Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models [7.9993879763024065]
This paper presents a theoretical and empirical examination of five commonly used techniques for unlearning in diffusion models.<n>We introduce two new evaluation metrics: Concept Retrieval Score (textbfCRS) and Concept Confidence Score (textbfCCS)
arXiv Detail & Related papers (2024-09-09T14:38:31Z)
Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Rethinking and Defending Protective Perturbation in Personalized Diffusion Models [21.30373461975769]
We study the fine-tuning process of personalized diffusion models (PDMs) through the lens of shortcut learning. PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. We propose a systematic defense framework that includes data purification and contrastive decoupling learning.
arXiv Detail & Related papers (2024-06-27T07:14:14Z)
Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient [20.698305103879232]
We propose a novel concept domain correction framework named textbfDoCo (textbfDomaintextbfCorrection)<n>By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts.<n>We also introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts.
arXiv Detail & Related papers (2024-05-24T07:47:36Z)
Model Will Tell: Training Membership Inference for Diffusion Models [15.16244745642374]
Training Membership Inference (TMI) task aims to determine whether a specific sample has been used in the training process of a target model. In this paper, we explore a novel perspective for the TMI task by leveraging the intrinsic generative priors within the diffusion model.
arXiv Detail & Related papers (2024-03-13T12:52:37Z)
Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey [73.86861112002593]
We present a comprehensive review of recent diffusion model-based methods on image restoration.<n>We classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR.<n>We propose five potential and challenging directions for the future research of diffusion model-based IR.
arXiv Detail & Related papers (2023-08-18T08:40:38Z)
Exploiting Diffusion Prior for Real-World Image Super-Resolution [75.5898357277047]
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution. By employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model.
arXiv Detail & Related papers (2023-05-11T17:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.