Related papers: Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

URL: http://arxiv.org/abs/2512.24146v1
Date: Tue, 30 Dec 2025 11:17:52 GMT
Title: Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Authors: Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li,
Abstract summary: We propose a novel framework that mitigates Preference Mode Collapse (PMC)<n>D$2$-Align achieves superior alignment with human preference.
Score: 27.33241821967005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

Related papers

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment [49.45064510462232]
GRPO-based approaches for text-to-image generation suffer from the sparse reward problem.<n>We introduce textbfDenseGRPO, a novel framework that aligns human preference with dense rewards.
arXiv Detail & Related papers (2026-01-28T03:39:05Z)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
Multi-dimensional Preference Alignment by Conditioning Reward Itself [32.33870784484853]
Multi Reward Conditional DPO resolves reward conflicts by introducing a disentangled Bradley-Terry objective.<n>Experiments on Stable 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks.
arXiv Detail & Related papers (2025-12-11T02:44:31Z)
G$^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO (G$2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales.<n>Our G$2$RPO significantly outperforms existing flow-based GRPO baselines.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Generative Model Inversion Through the Lens of the Manifold Hypothesis [98.37040155914595]
Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models.<n>Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process.
arXiv Detail & Related papers (2025-09-24T14:39:25Z)
S$^2$-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models [26.255679321570014]
S2-Guidance is a novel method that leverages block-dropping during the forward process to construct sub-networks.<n>Experiments on text-to-image and text-to-video generation tasks demonstrate that S2-Guidance delivers superior performance.
arXiv Detail & Related papers (2025-08-18T12:31:20Z)
Divergence Minimization Preference Optimization for Diffusion Model Alignment [66.31417479052774]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>DMPO can consistently outperform or match existing techniques across different base models and test sets.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling [27.11560841914813]
We introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses.<n>We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails.
arXiv Detail & Related papers (2025-07-08T21:56:33Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models.<n>We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications. Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space. We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.