Information Theoretic Text-to-Image Alignment
- URL: http://arxiv.org/abs/2405.20759v3
- Date: Tue, 11 Feb 2025 07:27:41 GMT
- Title: Information Theoretic Text-to-Image Alignment
- Authors: Chao Wang, Giulio Franzese, Alessandro Finamore, Massimo Gallo, Pietro Michiardi,
- Abstract summary: Mutual Information (MI) is used to guide model alignment.<n>Our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images.<n>Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI.
- Score: 49.396917351264655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models for Text-to-Image (T2I) conditional generation have recently achieved tremendous success. Yet, aligning these models with user's intentions still involves a laborious trial-and-error process, and this challenging alignment problem has attracted considerable attention from the research community. In this work, instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models, we use Mutual Information (MI) to guide model alignment. In brief, our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images to create a synthetic fine-tuning set for improving model alignment. Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI, and a simple fine-tuning strategy that improves alignment while maintaining image quality. Code available at https://github.com/Chao0511/mitune.
Related papers
- RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment [51.85242063075333]
Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation.
Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt.
We introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation.
arXiv Detail & Related papers (2025-03-18T15:41:45Z) - Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
We introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens.
Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z) - EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation [29.176750442205325]
In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks.
We introduce two new methods to evaluate the image-text alignment capabilities of T2I models.
arXiv Detail & Related papers (2024-12-24T04:08:25Z) - Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models [67.68871360210208]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency.
We propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models.
We show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Zero-Shot Image Harmonization with Generative Model Prior [22.984119094424056]
We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images.
We introduce a fully modularized framework inspired by human behavior.
We present compelling visual results across diverse scenes and objects, along with a user study validating our approach.
arXiv Detail & Related papers (2023-07-17T00:56:21Z) - Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models [103.61066310897928]
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt.
While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt.
We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt.
We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
arXiv Detail & Related papers (2023-01-31T18:10:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.