Related papers: Exposing Text-Image Inconsistency Using Diffusion Models

Exposing Text-Image Inconsistency Using Diffusion Models

URL: http://arxiv.org/abs/2404.18033v1
Date: Sun, 28 Apr 2024 00:29:24 GMT
Title: Exposing Text-Image Inconsistency Using Diffusion Models
Authors: Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu,
Abstract summary: A growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. This study introduces D-TIIL, which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency.
Score: 36.820267498751626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation.

Related papers

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images. We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization [15.920735314050296]
This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry. We propose DECOR, which projects text embeddings onto a vector space to undesired token vectors. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models.
arXiv Detail & Related papers (2024-12-12T10:59:44Z)
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies. Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z)
The Right Losses for the Right Gains: Improving the Semantic Consistency of Deep Text-to-Image Generation with Distribution-Sensitive Losses [0.35898124827270983]
We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss. We test this approach on two baseline models: SSAGAN and AttnGAN. Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
arXiv Detail & Related papers (2023-12-18T00:05:28Z)
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment [64.49170817854942]
We present a method to provide detailed explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible captions for a given image. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations.
arXiv Detail & Related papers (2023-12-05T20:07:34Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
On Manipulating Scene Text in the Wild with Diffusion Models [4.034781390227754]
We introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. Our method achieves 94.15% and 98.12% on datasets for character-level evaluation.
arXiv Detail & Related papers (2023-11-01T11:31:50Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media [93.51739200834837]
We propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatic retrieval of suitable images for the given captions. Our large-scale automatically generated NewsCLIPpings dataset requires models to jointly analyze both modalities.
arXiv Detail & Related papers (2021-04-13T01:53:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.