Related papers: Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

URL: http://arxiv.org/abs/2312.16204v2
Date: Fri, 5 Jul 2024 15:59:24 GMT
Title: Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Authors: Xinyan Chen, Jiaxin Ge, Tianjun Zhang, Jiaming Liu, Shanghang Zhang,
Abstract summary: Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
Score: 33.51524424536508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have shown impressive performance in many domains, including image generation, time series prediction, and reinforcement learning. The algorithm demonstrates superior performance over the traditional GAN and transformer-based methods. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. It has been an important research area to enhance such capability. Prior works have shown that using Reinforcement Learning can effectively train diffusion models to enhance fidelity on specific objectives. However, existing RL methods require collecting a large amount of data to train an effective reward model. They also don't receive feedback when the generated image is incorrect. In this work, we propose Iterative Prompt Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling. IPR first samples a batch of images conditioned on the text then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods.

Related papers

FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL [78.59912944698992]
We propose FocusDiff to enhance fine-grained text-image semantic alignment.<n>We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics.<n>Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
arXiv Detail & Related papers (2025-06-05T18:36:33Z)
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences [28.683767105094393]
We propose an alternative approach that leverages cycle consistency as a supervisory signal.<n>We map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction.<n>We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs.
arXiv Detail & Related papers (2025-06-02T17:42:58Z)
VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z)
Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
We introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z)
Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z)
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images. We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z)
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models [94.25020178662392]
Text-to-image (T2I) research has grown explosively in the past year. One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
arXiv Detail & Related papers (2023-05-25T16:30:07Z)
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z)
Bi-directional Training for Composed Image Retrieval via Text Prompt Learning [46.60334745348141]
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text. We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
arXiv Detail & Related papers (2023-03-29T11:37:41Z)
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models. LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z)
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z)
InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images [4.544151613454639]
We argue that depending on the application, users of image retrieval systems have different and changing similarity notions. We present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting. InDiReCT is a model for LanZ-DML on images that exclusively uses a few text prompts for training.
arXiv Detail & Related papers (2022-11-23T08:09:50Z)
Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.