TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization
- URL: http://arxiv.org/abs/2412.18185v3
- Date: Sat, 25 Jan 2025 02:19:33 GMT
- Title: TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization
- Authors: Yucong Luo, Mingyue Cheng, Jie Ouyang, Xiaoyu Tao, Qi Liu,
- Abstract summary: This paper introduces TextMatch, a novel framework to address image-text discrepancies in text-to-image (T2I) generation and editing.
TextMatch employs a scoring strategy powered by large language models (LLMs) and visual question-answering (VQA) models to evaluate semantic consistency between prompts and generated images.
- Score: 8.591857157392718
- License:
- Abstract: Text-to-image generative models excel in creating images from text but struggle with ensuring alignment and consistency between outputs and prompts. This paper introduces TextMatch, a novel framework that leverages multimodal optimization to address image-text discrepancies in text-to-image (T2I) generation and editing. TextMatch employs a scoring strategy powered by large language models (LLMs) and visual question-answering (VQA) models to evaluate semantic consistency between prompts and generated images. By integrating multimodal in-context learning and chain of thought reasoning, our method dynamically refines prompts through iterative optimization. This process ensures that the generated images better capture user intent of, resulting in higher fidelity and relevance. Extensive experiments demonstrate that TextMatch significantly improves text-image consistency across multiple benchmarks, establishing a reliable framework for advancing the capabilities of text-to-image generative models. Our code is available at https://anonymous.4open.science/r/TextMatch-F55C/.
Related papers
- One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.
They struggle to support the consistent generation of identity-preserving requirements for storytelling.
We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - Prompt Expansion for Adaptive Text-to-Image Generation [51.67811570987088]
This paper proposes a Prompt Expansion framework that helps users generate high-quality, diverse images with less effort.
The Prompt Expansion model takes a text query as input and outputs a set of expanded text prompts.
We conduct a human evaluation study that shows that images generated through Prompt Expansion are more aesthetically pleasing and diverse than those generated by baseline methods.
arXiv Detail & Related papers (2023-12-27T21:12:21Z) - Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts.
Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z) - What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation.
We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks.
We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z) - TextMatcher: Cross-Attentional Neural Network to Compare Image and Text [0.0]
We devise the first machine-learning model specifically designed for this problem.
We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset.
We showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.
arXiv Detail & Related papers (2022-05-11T14:01:12Z) - Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer [8.069590683507997]
We propose MXQ-VAE, a vector quantization method for multimodal image-text representation.
MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space.
We can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation.
arXiv Detail & Related papers (2022-04-15T16:29:55Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.