Related papers: AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

URL: http://arxiv.org/abs/2403.13352v3
Date: Wed, 3 Apr 2024 13:08:55 GMT
Title: AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Authors: Jingkun An, Yinghao Zhu, Zongjian Li, Haoran Feng, Bohua Chen, Yemin Shi, Chengwei Pan,
Abstract summary: Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. We introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.
Score: 4.054100650064423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.

Related papers

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation [52.0601996237501]
Chain-of-Frame (CoF) reasoning enables frame-by-frame visual inference.<n>CoF-T2I integrates CoF reasoning into text-to-image (T2I) generation via progressive visual refinement.<n>Experiments show that CoF-T2I significantly outperforms the base video model.
arXiv Detail & Related papers (2026-01-15T04:33:06Z)
ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion [18.25085327318649]
We develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions.<n>We develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation.
arXiv Detail & Related papers (2025-11-24T04:10:53Z)
Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs [36.42060582800515]
We introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models.<n>TPO works by training the model to prefer matched prompts over mismatched prompts.<n>Our framework is general and compatible with existing preference-based algorithms.
arXiv Detail & Related papers (2025-09-30T04:32:34Z)
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z)
RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment [51.85242063075333]
Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt. We introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation.
arXiv Detail & Related papers (2025-03-18T15:41:45Z)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models. We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z)
EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation [29.176750442205325]
In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. We introduce two new methods to evaluate the image-text alignment capabilities of T2I models.
arXiv Detail & Related papers (2024-12-24T04:08:25Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process. We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
arXiv Detail & Related papers (2024-10-23T16:42:56Z)
FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z)
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models. It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z)
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z)
Improving Text-to-Image Consistency via Automatic Prompt Optimization [26.2587505265501]
We introduce a T2I optimization-by-prompting framework, OPT2I, to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score.
arXiv Detail & Related papers (2024-03-26T15:42:01Z)
Direct Consistency Optimization for Compositional Text-to-Image Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z)
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI) In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion) Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity. We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors. I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z)
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation [15.411325887412413]
This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM) FSA-CDM introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Experiments are conducted on four benchmark datasets from different domains.
arXiv Detail & Related papers (2023-08-02T13:43:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.