Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
- URL: http://arxiv.org/abs/2601.03213v1
- Date: Tue, 06 Jan 2026 17:52:02 GMT
- Title: Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
- Authors: Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv,
- Abstract summary: Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility.<n>We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process.<n>Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations show that (i) per-step critics and (ii) noisy-conditioned rewards are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.
Related papers
- Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion [44.47036589940717]
Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization.<n>We propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity.<n>RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"
arXiv Detail & Related papers (2026-02-24T09:07:08Z) - Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [29.56905427210088]
gradient-ARM is a framework that jointly optimize a rubric generator and a judge using reinforcement learning from preference feedback.<n>We show that gradient-ARM achieves state-of-the-art performance among baselines on benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
arXiv Detail & Related papers (2026-02-02T00:50:53Z) - ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models [12.021923446217722]
Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models.<n>Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations.<n>We introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem.
arXiv Detail & Related papers (2026-01-30T21:56:50Z) - Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - Data-regularized Reinforcement Learning for Diffusion Models at Scale [99.01056178660538]
We introduce Data-regularized Diffusion Reinforcement Learning ( DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution.<n>With over a million GPU hours of experiments and ten thousand double-blind evaluations, we demonstrate that DDRL significantly improves rewards while alleviating the reward hacking seen in RLs.
arXiv Detail & Related papers (2025-12-03T23:45:07Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning [25.53799024782883]
Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model.<n>Recent findings reveal that unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting.
arXiv Detail & Related papers (2025-10-01T10:50:14Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards [52.90573877727541]
reinforcement learning (RL) has been considered for diffusion model fine-tuning.<n>RL's effectiveness is limited by the challenge of sparse reward.<n>$textB2text-DiffuRL$ is compatible with existing optimization algorithms.
arXiv Detail & Related papers (2025-03-14T09:45:19Z) - Exploiting Diffusion Prior for Real-World Image Super-Resolution [75.5898357277047]
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution.
By employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model.
arXiv Detail & Related papers (2023-05-11T17:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.