MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
- URL: http://arxiv.org/abs/2510.25897v1
- Date: Wed, 29 Oct 2025 18:59:17 GMT
- Title: MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency
- Authors: Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard,
- Abstract summary: Current text-to-image generative models are trained on large uncurated datasets.<n>We propose to condition the model on multiple reward models during training to let the model learn user preferences directly.
- Score: 21.27005111847166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current text-to-image generative models are trained on large uncurated datasets to enable diverse generation capabilities. However, this does not align well with user preferences. Recently, reward models have been specifically designed to perform post-hoc selection of generated images and align them to a reward, typically user preference. This discarding of informative data together with the optimizing for a single reward tend to harm diversity, semantic fidelity and efficiency. Instead of this post-processing, we propose to condition the model on multiple reward models during training to let the model learn user preferences directly. We show that this not only dramatically improves the visual quality of the generated images but it also significantly speeds up the training. Our proposed method, called MIRO, achieves state-of-the-art performances on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
Related papers
- The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation [52.648073272395635]
We introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator.<n>Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs.<n>In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively.
arXiv Detail & Related papers (2025-11-25T12:35:57Z) - EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [43.239693852521185]
mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks.<n>mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new benchname.<n>mname with its training dataset will be released to help the community build more high-quality image editing training datasets.
arXiv Detail & Related papers (2025-09-30T14:51:04Z) - Activation Reward Models for Few-Shot Model Alignment [77.37511364793515]
We introduce Activation Reward Models (Activation RMs)<n>Activation RMs leverage activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning.<n>We demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications.
arXiv Detail & Related papers (2025-07-02T05:10:29Z) - Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation [61.31036260686349]
We propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model.<n> Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt.<n>Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback.
arXiv Detail & Related papers (2025-05-22T15:05:07Z) - Capturing Individual Human Preferences with Reward Features [47.43999785878563]
We show that individual preferences can be captured as a linear combination of a set of general reward features.<n>We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual.<n>We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts.
arXiv Detail & Related papers (2025-03-21T17:39:33Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.<n>With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.<n>Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models [20.70550870149442]
We introduce Annealed Importance Guidance (AIG), an inference-time regularization inspired by Annealed Importance Sampling.
Our experiments demonstrate the benefits of AIG for Stable Diffusion models, striking the optimal balance between reward optimization and image diversity.
arXiv Detail & Related papers (2024-09-09T16:27:26Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Human Preference Score: Better Aligning Text-to-Image Models with Human
Preference [41.270068272447055]
We collect a dataset of human choices on generated images from the Stable Foundation Discord channel.
Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices.
We propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences.
arXiv Detail & Related papers (2023-03-25T10:09:03Z) - Ensembling Off-the-shelf Models for GAN Training [55.34705213104182]
We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators.
We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings.
Our method can improve GAN training in both limited data and large-scale settings.
arXiv Detail & Related papers (2021-12-16T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.