Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback
- URL: http://arxiv.org/abs/2307.04749v2
- Date: Wed, 6 Dec 2023 00:45:08 GMT
- Title: Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback
- Authors: Jaskirat Singh and Liang Zheng
- Abstract summary: We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
- Score: 20.78162037954646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of text-conditioned image generation has made unparalleled progress
with the recent advent of latent diffusion models. While remarkable, as the
complexity of given text input increases, the state-of-the-art diffusion models
may still fail in generating images which accurately convey the semantics of
the given prompt. Furthermore, it has been observed that such misalignments are
often left undetected by pretrained multi-modal models such as CLIP. To address
these problems, in this paper we explore a simple yet effective decompositional
approach towards both evaluation and improvement of text-to-image alignment. In
particular, we first introduce a Decompositional-Alignment-Score which given a
complex prompt decomposes it into a set of disjoint assertions. The alignment
of each assertion with generated images is then measured using a VQA model.
Finally, alignment scores for different assertions are combined aposteriori to
give the final text-to-image alignment score. Experimental analysis reveals
that the proposed alignment metric shows significantly higher correlation with
human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also
find that the assertion level alignment scores provide a useful feedback which
can then be used in a simple iterative procedure to gradually increase the
expression of different assertions in the final image outputs. Human user
studies indicate that the proposed approach surpasses previous state-of-the-art
by 8.7% in overall text-to-image alignment accuracy. Project page for our paper
is available at https://1jsingh.github.io/divide-evaluate-and-refine
Related papers
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Improving Compositional Text-to-image Generation with Large
Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts.
We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts.
Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z) - Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images.
A higher likelihood indicates better perceptual quality and better text-image alignment.
It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z) - Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models [35.02969643344228]
We present a training free approach called Text-Anchored Score Composition (TASC) to improve controllability of existing models.
We propose an attention operation to realign these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back.
arXiv Detail & Related papers (2023-06-26T03:48:15Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z) - TIER: Text-Image Entropy Regularization for CLIP-style models [0.0]
In CLIP-style models, text-token embeddings should have high similarity to only a small number of image-patch embeddings.
We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores.
We show that the proposed regularization scheme shrinks the text-token and image-patch similarity scores towards zero, thus achieving the desired effect.
arXiv Detail & Related papers (2022-12-13T16:29:13Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.