Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation
- URL: http://arxiv.org/abs/2503.13070v1
- Date: Mon, 17 Mar 2025 11:21:43 GMT
- Title: Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation
- Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Kenji Kawaguchi, Jing Tang,
- Abstract summary: We introduce R0, a novel conditional generation approach via regularized reward.<n>We train state-of-the-art few-step text-to-image generative models with R0 at scales.<n>Our results challenge the conventional wisdom of diffusion post-training and conditional generation.
- Score: 25.29877217341663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning generated images to complicated text prompts and human preferences is a central challenge in Artificial Intelligence-Generated Content (AIGC). With reward-enhanced diffusion distillation emerging as a promising approach that boosts controllability and fidelity of text-to-image models, we identify a fundamental paradigm shift: as conditions become more specific and reward signals stronger, the rewards themselves become the dominant force in generation. In contrast, the diffusion losses serve as an overly expensive form of regularization. To thoroughly validate our hypothesis, we introduce R0, a novel conditional generation approach via regularized reward maximization. Instead of relying on tricky diffusion distillation losses, R0 proposes a new perspective that treats image generations as an optimization problem in data space which aims to search for valid images that have high compositional rewards. By innovative designs of the generator parameterization and proper regularization techniques, we train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation by demonstrating that rewards play a dominant role in scenarios with complex conditions. We hope our findings can contribute to further research into human-centric and reward-centric generation paradigms across the broader field of AIGC. Code is available at https://github.com/Luo-Yihong/R0.
Related papers
- Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation [34.08660401151558]
We focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images.<n>We introduce textbfPrompt textbfAdaptation with textbfGFlowNets (textbfPAG), a novel approach that frames prompt adaptation as a probabilistic inference problem.
arXiv Detail & Related papers (2025-02-17T06:28:53Z) - Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.<n>We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - Reward Incremental Learning in Text-to-Image Generation [26.64026346266299]
We present Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead.
The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality gradient generation in RIL scenarios.
arXiv Detail & Related papers (2024-11-26T10:54:33Z) - InstantIR: Blind Image Restoration with Instant Generative Reference [10.703499573064537]
We introduce Instant-reference Image Restoration (InstantIR), a novel diffusion-based BIR method.
We first extract a compact representation of the input via a pre-trained vision encoder.
At each generation step, this representation is used to decode current diffusion latent and instantiate it in the generative prior.
The degraded image is then encoded with this reference, providing robust generation condition.
arXiv Detail & Related papers (2024-10-09T05:15:29Z) - Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models [20.70550870149442]
We introduce Annealed Importance Guidance (AIG), an inference-time regularization inspired by Annealed Importance Sampling.
Our experiments demonstrate the benefits of AIG for Stable Diffusion models, striking the optimal balance between reward optimization and image diversity.
arXiv Detail & Related papers (2024-09-09T16:27:26Z) - RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection [60.960988614701414]
RIGID is a training-free and model-agnostic method for robust AI-generated image detection.
RIGID significantly outperforms existing trainingbased and training-free detectors.
arXiv Detail & Related papers (2024-05-30T14:49:54Z) - JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models.
Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy.
We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z) - Iterative Token Evaluation and Refinement for Real-World
Super-Resolution [77.74289677520508]
Real-world image super-resolution (RWSR) is a long-standing problem as low-quality (LQ) images often have complex and unidentified degradations.
We propose an Iterative Token Evaluation and Refinement framework for RWSR.
We show that ITER is easier to train than Generative Adversarial Networks (GANs) and more efficient than continuous diffusion models.
arXiv Detail & Related papers (2023-12-09T17:07:32Z) - CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster
Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks.
However, their widespread adoption is hindered by the high computational cost, which limits their real-time application.
We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.