Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
- URL: http://arxiv.org/abs/2509.25771v1
- Date: Tue, 30 Sep 2025 04:32:34 GMT
- Title: Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
- Authors: Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao,
- Abstract summary: We introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models.<n>TPO works by training the model to prefer matched prompts over mismatched prompts.<n>Our framework is general and compatible with existing preference-based algorithms.
- Score: 36.42060582800515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.
Related papers
- Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
SoftREPA is a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment.<n>Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z) - Dual Caption Preference Optimization for Diffusion Models [53.218293277964165]
We introduce Dual Caption Preference Optimization (DCPO) to improve text-to-image diffusion models.<n>DCPO assigns two distinct captions to each preference pair, which reinforces the learning signal.<n>Experiments show that DCPO significantly improves image quality and relevance to prompts.
arXiv Detail & Related papers (2025-02-09T20:34:43Z) - Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner.<n>Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
Mutual Information (MI) is used to guide model alignment.<n>Our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images.<n>Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models [67.68871360210208]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency.<n>We propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models.<n>We show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Reinforcement Learning from Diffusion Feedback: Q* for Image Search [2.5835347022640254]
We present two models for image generation using model-agnostic learning.
RLDF is a singular approach for visual imitation through prior-preserving reward function guidance.
It generates high-quality images over varied domains showcasing class-consistency and strong visual diversity.
arXiv Detail & Related papers (2023-11-27T09:20:12Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.