MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models
- URL: http://arxiv.org/abs/2510.01549v1
- Date: Thu, 02 Oct 2025 00:47:36 GMT
- Title: MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models
- Authors: Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, Mubarak Shah,
- Abstract summary: Diffusion models excel at generating images conditioned on text prompts.<n>The resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores.<n>Recently, inference-time alignment via noise optimization has emerged as an efficient alternative.<n>We show that this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt.
- Score: 86.07486858219137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60\% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.
Related papers
- FAIL: Flow Matching Adversarial Imitation Learning for Image Generation [52.643484089126844]
Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to Imitation learning.<n>We propose Flow Matching Adrial Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons.
arXiv Detail & Related papers (2026-02-12T16:36:33Z) - TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning [53.52543819839442]
A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds.<n>We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them.
arXiv Detail & Related papers (2025-11-27T09:14:26Z) - Fine-Tuning Diffusion Models via Intermediate Distribution Shaping [33.26998978897412]
Policy gradient methods are widely used in the context of autoregressive generation.<n>We show that GRAFT implicitly performs PPO with reshaped rewards.<n>We then introduce P-GRAFT to shape distributions at intermediate noise levels.<n>Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards.
arXiv Detail & Related papers (2025-10-03T03:18:47Z) - Learn to Guide Your Diffusion Model [84.82855046749657]
We study a technique for improving quality of samples from conditional diffusion models.<n>We learn guidance weights $omega_c,(s,t)$, which are functions of the conditioning $c$, the time $t$ from which we denoise, and the time $s$ towards which we denoise.<n>We extend our framework to reward guided sampling, enabling the model to target distributions tilted by a reward function.
arXiv Detail & Related papers (2025-10-01T12:21:48Z) - Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards [52.90573877727541]
reinforcement learning (RL) has been considered for diffusion model fine-tuning.<n>RL's effectiveness is limited by the challenge of sparse reward.<n>$textB2text-DiffuRL$ is compatible with existing optimization algorithms.
arXiv Detail & Related papers (2025-03-14T09:45:19Z) - Continuous Speculative Decoding for Autoregressive Image Generation [27.308442169466975]
Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation.<n> speculative decoding has effectively accelerated discrete autoregressive inference.<n>This work addresses challenges from low acceptance rate, inconsistent output distribution, and modified distribution without analytic expression.
arXiv Detail & Related papers (2024-11-18T09:19:15Z) - David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training [8.352666876052616]
We propose Diff-Instruct* (DI*), a data-efficient post-training approach for one-step text-to-image generative models.<n>Our method frames alignment as online reinforcement learning from human feedback.<n>Our 2.6B emphDI*-SDXL-1step model outperforms the 50-step 12B FLUX-dev model.
arXiv Detail & Related papers (2024-10-28T10:26:19Z) - Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
We propose an algorithm that enables fast and high-quality generation under arbitrary constraints.<n>During inference, we can interchange between gradient updates computed on the noisy image and updates computed on the final, clean image.<n>Our approach produces results that rival or surpass the state-of-the-art training-free inference approaches.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Direct Unsupervised Denoising [60.71146161035649]
Unsupervised denoisers do not directly produce a single prediction, such as the MMSE estimate.
We present an alternative approach that trains a deterministic network alongside the VAE to directly predict a central tendency.
arXiv Detail & Related papers (2023-10-27T13:02:12Z) - On the Posterior Distribution in Denoising: Application to Uncertainty
Quantification [28.233696029453775]
Tweedie's formula links the posterior mean in Gaussian denoising with the score of the data distribution.
We show how to efficiently compute the principal components of the posterior distribution for any desired region of an image.
Our method is fast and memory-efficient, as it does not explicitly compute or store the high-order moment tensors.
arXiv Detail & Related papers (2023-09-24T10:07:40Z) - RAIN: A Simple Approach for Robust and Accurate Image Classification
Networks [156.09526491791772]
It has been shown that the majority of existing adversarial defense methods achieve robustness at the cost of sacrificing prediction accuracy.
This paper proposes a novel preprocessing framework, which we term Robust and Accurate Image classificatioN(RAIN)
RAIN applies randomization over inputs to break the ties between the model forward prediction path and the backward gradient path, thus improving the model robustness.
We conduct extensive experiments on the STL10 and ImageNet datasets to verify the effectiveness of RAIN against various types of adversarial attacks.
arXiv Detail & Related papers (2020-04-24T02:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.