Related papers: TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization

TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization

URL: http://arxiv.org/abs/2412.18185v3
Date: Sat, 25 Jan 2025 02:19:33 GMT
Title: TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization
Authors: Yucong Luo, Mingyue Cheng, Jie Ouyang, Xiaoyu Tao, Qi Liu,
Abstract summary: This paper introduces TextMatch, a novel framework to address image-text discrepancies in text-to-image (T2I) generation and editing.<n>TextMatch employs a scoring strategy powered by large language models (LLMs) and visual question-answering (VQA) models to evaluate semantic consistency between prompts and generated images.
Score: 8.591857157392718
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image generative models excel in creating images from text but struggle with ensuring alignment and consistency between outputs and prompts. This paper introduces TextMatch, a novel framework that leverages multimodal optimization to address image-text discrepancies in text-to-image (T2I) generation and editing. TextMatch employs a scoring strategy powered by large language models (LLMs) and visual question-answering (VQA) models to evaluate semantic consistency between prompts and generated images. By integrating multimodal in-context learning and chain of thought reasoning, our method dynamically refines prompts through iterative optimization. This process ensures that the generated images better capture user intent of, resulting in higher fidelity and relevance. Extensive experiments demonstrate that TextMatch significantly improves text-image consistency across multiple benchmarks, establishing a reliable framework for advancing the capabilities of text-to-image generative models. Our code is available at https://anonymous.4open.science/r/TextMatch-F55C/.

Related papers

FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL [78.59912944698992]
We propose FocusDiff to enhance fine-grained text-image semantic alignment.<n>We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics.<n>Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
arXiv Detail & Related papers (2025-06-05T18:36:33Z)
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching [8.470864568439968]
This paper presents an improved evaluation method named Instruction-augmented Multimodal Alignment for Image-Text and Element Matching (iMatch) iMatch evaluates image-text semantic alignment by fine-tuning multimodal large language models. Our experimental results show that the iMatch method significantly surpasses existing methods, confirming its effectiveness and practical value.
arXiv Detail & Related papers (2025-04-16T12:21:49Z)
TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images. We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts. They struggle to support the consistent generation of identity-preserving requirements for storytelling. We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs) We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z)
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark. FineMatch focuses on text and image mismatch detection and correction. We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z)
Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models. We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z)
Prompt Expansion for Adaptive Text-to-Image Generation [51.67811570987088]
This paper proposes a Prompt Expansion framework that helps users generate high-quality, diverse images with less effort. The Prompt Expansion model takes a text query as input and outputs a set of expanded text prompts. We conduct a human evaluation study that shows that images generated through Prompt Expansion are more aesthetically pleasing and diverse than those generated by baseline methods.
arXiv Detail & Related papers (2023-12-27T21:12:21Z)
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z)
What You See is What You Read? Improving Text-Image Alignment Evaluation [28.722369586165108]
We study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE, spanning multiple datasets from both text-to-image and image-to-text generation tasks. We describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models.
arXiv Detail & Related papers (2023-05-17T17:43:38Z)
GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z)
Do DALL-E and Flamingo Understand Each Other? [36.4732744974398]
We propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. We propose a unified framework to finetune the text-to-image and image-to-text models.
arXiv Detail & Related papers (2022-12-23T10:46:56Z)
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text [0.0]
We devise the first machine-learning model specifically designed for this problem. We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset. We showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.
arXiv Detail & Related papers (2022-05-11T14:01:12Z)
Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer [8.069590683507997]
We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space. We can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation.
arXiv Detail & Related papers (2022-04-15T16:29:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.