The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
- URL: http://arxiv.org/abs/2511.20256v1
- Date: Tue, 25 Nov 2025 12:35:57 GMT
- Title: The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
- Authors: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou,
- Abstract summary: We introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator.<n>Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs.<n>In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively.
- Score: 52.648073272395635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
Related papers
- Understanding Reward Hacking in Text-to-Image Reinforcement Learning [43.358394359914314]
We analyze reward hacking behaviors in text-to-image (T2I) RL post-training.<n>Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images.<n>We propose a lightweight and adaptive artifact reward model, trained on a small dataset of artifact-free and artifact-containing samples.
arXiv Detail & Related papers (2026-01-06T23:43:47Z) - MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency [21.27005111847166]
Current text-to-image generative models are trained on large uncurated datasets.<n>We propose to condition the model on multiple reward models during training to let the model learn user preferences directly.
arXiv Detail & Related papers (2025-10-29T18:59:17Z) - GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning [90.99527142037853]
We develop GRAM-R$2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales.<n>GRAM-R$2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning.
arXiv Detail & Related papers (2025-09-02T16:41:07Z) - Residual Reward Models for Preference-based Reinforcement Learning [11.797520525358564]
Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify.<n>PbRL can suffer from slow convergence speed since it requires training in a reward model.<n>We propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM)
arXiv Detail & Related papers (2025-07-01T09:43:57Z) - Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z) - T-REG: Preference Optimization with Token-Level Reward Regularization [35.07328450591201]
Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models with human values.<n>Recent methods have attempted to address this limitation by introducing token-level rewards.<n>We propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization.
arXiv Detail & Related papers (2024-12-03T18:56:07Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.