Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback
- URL: http://arxiv.org/abs/2307.04749v2
- Date: Wed, 6 Dec 2023 00:45:08 GMT
- Title: Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback
- Authors: Jaskirat Singh and Liang Zheng
- Abstract summary: We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
- Score: 20.78162037954646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of text-conditioned image generation has made unparalleled progress
with the recent advent of latent diffusion models. While remarkable, as the
complexity of given text input increases, the state-of-the-art diffusion models
may still fail in generating images which accurately convey the semantics of
the given prompt. Furthermore, it has been observed that such misalignments are
often left undetected by pretrained multi-modal models such as CLIP. To address
these problems, in this paper we explore a simple yet effective decompositional
approach towards both evaluation and improvement of text-to-image alignment. In
particular, we first introduce a Decompositional-Alignment-Score which given a
complex prompt decomposes it into a set of disjoint assertions. The alignment
of each assertion with generated images is then measured using a VQA model.
Finally, alignment scores for different assertions are combined aposteriori to
give the final text-to-image alignment score. Experimental analysis reveals
that the proposed alignment metric shows significantly higher correlation with
human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also
find that the assertion level alignment scores provide a useful feedback which
can then be used in a simple iterative procedure to gradually increase the
expression of different assertions in the final image outputs. Human user
studies indicate that the proposed approach surpasses previous state-of-the-art
by 8.7% in overall text-to-image alignment accuracy. Project page for our paper
is available at https://1jsingh.github.io/divide-evaluate-and-refine
Related papers
- Extract Free Dense Misalignment from CLIP [7.0247398611254175]
This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP.
We revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment.
Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models.
arXiv Detail & Related papers (2024-12-24T12:51:05Z) - DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization [15.920735314050296]
This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry.
We propose DECOR, which projects text embeddings onto a vector space to undesired token vectors.
Experimental results demonstrate that DECOR outperforms state-of-the-art customization models.
arXiv Detail & Related papers (2024-12-12T10:59:44Z) - Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback [5.415802995586328]
Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models.
We propose an efficient fine-turning method with specific reward objectives, including three stages.
Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity.
arXiv Detail & Related papers (2024-11-28T09:56:28Z) - Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images.
A higher likelihood indicates better perceptual quality and better text-image alignment.
It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z) - Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models [35.02969643344228]
We present a training free approach called Text-Anchored Score Composition (TASC) to improve controllability of existing models.
We propose an attention operation to realign these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back.
arXiv Detail & Related papers (2023-06-26T03:48:15Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.