ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion
- URL: http://arxiv.org/abs/2511.18742v2
- Date: Wed, 26 Nov 2025 19:17:47 GMT
- Title: ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion
- Authors: Zhenghan Fang, Jian Zheng, Qiaozi Gao, Xiaofeng Gao, Jeremias Sulam,
- Abstract summary: We develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions.<n>We develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation.
- Score: 18.25085327318649
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.
Related papers
- Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages [6.470160796651034]
We propose a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences.<n>We show that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences.
arXiv Detail & Related papers (2026-02-02T03:32:00Z) - Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis [97.37770785712475]
We present a generation-based debiasing framework for object detection.<n>Our method significantly narrows the performance gap for underrepresented object groups.
arXiv Detail & Related papers (2025-10-21T02:19:12Z) - Performance Plateaus in Inference-Time Scaling for Text-to-Image Diffusion Without External Models [31.873727540047156]
We apply Best-of-N inference-time scaling to algorithms that optimize the initial noise of a text-to-image diffusion model.<n>We demonstrate that inference-time scaling for text-to-image diffusion models in this setting quickly reaches a performance plateau.
arXiv Detail & Related papers (2025-06-14T21:25:08Z) - Policy Optimized Text-to-Image Pipeline Design [73.9633527029941]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z) - Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection [21.677178476653385]
We introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context capabilities.<n>We show that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model.<n>It achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80.
arXiv Detail & Related papers (2025-03-15T21:58:12Z) - Fast constrained sampling in pre-trained diffusion models [80.99262780028015]
We propose an algorithm that enables fast, high-quality generation under arbitrary constraints.<n>Our approach produces results that rival or surpass the state-of-the-art training-free inference methods.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Minority-Focused Text-to-Image Generation via Prompt Optimization [57.319845580050924]
We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models.<n>We develop an online prompt optimization framework that encourages emergence of desired properties during inference.<n>We then tailor this generic prompt distributions into a specialized solver that promotes generation of minority features.
arXiv Detail & Related papers (2024-10-10T11:56:09Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback.<n>We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - The Role of Data Curation in Image Captioning [26.61662352061468]
This paper contributes to this direction by actively curating difficult samples in datasets without increasing the total number of samples.
Experiments on the Flickr30K and COCO datasets with the BLIP and BEiT-3 models demonstrate that these curation methods do indeed yield improved image captioning models.
arXiv Detail & Related papers (2023-05-05T15:16:07Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.