Related papers: GARDO: Reinforcing Diffusion Models without Reward Hacking

GARDO: Reinforcing Diffusion Models without Reward Hacking

URL: http://arxiv.org/abs/2512.24138v1
Date: Tue, 30 Dec 2025 10:55:45 GMT
Title: GARDO: Reinforcing Diffusion Models without Reward Hacking
Authors: Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan,
Abstract summary: Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment.<n>The mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses.<n>We propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO) to address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking.
Score: 54.841464430913476
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

Related papers

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
GDRO: Group-level Reward Post-training Suitable for Diffusion Models [55.948229011478304]
Group-level rewards successfully align the model with the targeted reward.<n>Group-level Direct Reward Optimization (GDRO) is a new post-training paradigm for group-level reward alignment.<n>GDRO supports full offline training that saves the large time cost for image rollout sampling.<n>It is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtainity.
arXiv Detail & Related papers (2026-01-05T11:47:18Z)
Data-regularized Reinforcement Learning for Diffusion Models at Scale [99.01056178660538]
We introduce Data-regularized Diffusion Reinforcement Learning ( DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution.<n>With over a million GPU hours of experiments and ten thousand double-blind evaluations, we demonstrate that DDRL significantly improves rewards while alleviating the reward hacking seen in RLs.
arXiv Detail & Related papers (2025-12-03T23:45:07Z)
Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation [60.04281435591454]
CRDA (Curriculum Reinforcement-Learning Data Augmentation) is a novel framework guiding detectors to progressively master multi-domain forgery features.<n>Central to our approach is integrating reinforcement learning and causal inference.<n>Our method significantly improves detector generalizability, outperforming SOTA methods across multiple cross-domain datasets.
arXiv Detail & Related papers (2025-11-10T12:45:52Z)
Reinforced Preference Optimization for Recommendation [28.87206911186567]
We propose Reinforced Preference Optimization for Recommendation (ReRe) for generative recommenders.<n>ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives.<n>We show that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance.
arXiv Detail & Related papers (2025-10-14T07:04:33Z)
Multi-Metric Preference Alignment for Generative Speech Restoration [15.696247605348383]
We propose a multi-metric preference alignment strategy for generative models.<n>We observe consistent and significant performance gains across three diverse generative paradigms.<n>Our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels.
arXiv Detail & Related papers (2025-08-24T07:05:10Z)
Fake it till You Make it: Reward Modeling as Discriminative Prediction [49.31309674007382]
GAN-RM is an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering.<n>Our method trains the reward model through discrimination between a small set of representative, unpaired target samples.<n>Experiments demonstrate our GAN-RM's effectiveness across multiple key applications.
arXiv Detail & Related papers (2025-06-16T17:59:40Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models.<n>We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Test-time Alignment of Diffusion Models without Reward Over-optimization [8.981605934618349]
Diffusion models excel in generative tasks, but aligning them with specific objectives remains challenging.<n>We propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution.<n>We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization.
arXiv Detail & Related papers (2025-01-10T09:10:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.