Related papers: Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

URL: http://arxiv.org/abs/2410.00905v1
Date: Tue, 1 Oct 2024 17:50:17 GMT
Title: Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Authors: Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh,
Abstract summary: We introduce a model designed to improve the prediction of image-text alignment. Our approach focuses on generating high-quality training datasets for the alignment task. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
Score: 76.31530836622694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \url{https://yuheng-li.github.io/LLaVA-score/}

Related papers

Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation [26.186038156155522]
We propose a novel training method for Vision Language Models (VLMs) to generate not only image scores but also corresponding justifications in natural language.<n>Our method enables self-training, utilizing the VLM's generated text without relying on external data or models.
arXiv Detail & Related papers (2025-06-03T10:04:19Z)
Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
We introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z)
Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback [5.415802995586328]
Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. We propose an efficient fine-turning method with specific reward objectives, including three stages. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity.
arXiv Detail & Related papers (2024-11-28T09:56:28Z)
Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z)
Information Theoretic Text-to-Image Alignment [49.396917351264655]
Mutual Information (MI) is used to guide model alignment. Our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images. Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z)
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z)
Improving Compositional Text-to-image Generation with Large Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts. We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z)
Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions. We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions. We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z)
Variational Distribution Learning for Unsupervised Text-to-Image Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z)
Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts. We propose a fine-tuning method for aligning such models using human feedback. Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.