Understanding Reward Hacking in Text-to-Image Reinforcement Learning
- URL: http://arxiv.org/abs/2601.03468v1
- Date: Tue, 06 Jan 2026 23:43:47 GMT
- Title: Understanding Reward Hacking in Text-to-Image Reinforcement Learning
- Authors: Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh,
- Abstract summary: We analyze reward hacking behaviors in text-to-image (T2I) RL post-training.<n>Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images.<n>We propose a lightweight and adaptive artifact reward model, trained on a small dataset of artifact-free and artifact-containing samples.
- Score: 43.358394359914314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.
Related papers
- Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking [69.06218054848803]
We propose Adrial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game.<n>ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations.<n>ARA achieves the best alignment-utility tradeoff among all baselines.
arXiv Detail & Related papers (2026-02-02T07:34:57Z) - The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation [52.648073272395635]
We introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator.<n>Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs.<n>In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively.
arXiv Detail & Related papers (2025-11-25T12:35:57Z) - Reward Hacking Mitigation using Verifiable Composite Rewards [5.061948558533868]
Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision.<n>This work addresses two primary forms of this behavior: i.<n>providing a final answer without preceding reasoning, and ii. employing non-standard reasoning formats to exploit the reward mechanism.
arXiv Detail & Related papers (2025-09-19T03:40:27Z) - Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models [28.542061921495353]
There are two mainstream reward paradigms: model-based rewards and rule-based rewards.<n>Both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking.<n>We propose Cooper, a RL framework that jointly optimize both the policy model and the reward model.<n>Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct.
arXiv Detail & Related papers (2025-08-07T17:53:56Z) - Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z) - Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Preference As Reward (PAR) is a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning.<n>On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches.
arXiv Detail & Related papers (2025-02-26T02:57:59Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.