Is Your Text-to-Image Model Robust to Caption Noise?
- URL: http://arxiv.org/abs/2412.19531v1
- Date: Fri, 27 Dec 2024 08:53:37 GMT
- Title: Is Your Text-to-Image Model Robust to Caption Noise?
- Authors: Weichen Yu, Ziyan Yang, Shanchuan Lin, Qi Zhao, Jianyi Wang, Liangke Gui, Matt Fredrikson, Lu Jiang,
- Abstract summary: In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning.
Even though VLMs are known to exhibit hallucination, generating descriptive content that deviates from the visual reality, the ramifications of such caption hallucinations on T2I generation performance remain under-explored.
- Score: 38.19377765665836
- License:
- Abstract: In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning. Even though VLMs are known to exhibit hallucination, generating descriptive content that deviates from the visual reality, the ramifications of such caption hallucinations on T2I generation performance remain under-explored. Through our empirical investigation, we first establish a comprehensive dataset comprising VLM-generated captions, and then systematically analyze how caption hallucination influences generation outcomes. Our findings reveal that (1) the disparities in caption quality persistently impact model outputs during fine-tuning. (2) VLMs confidence scores serve as reliable indicators for detecting and characterizing noise-related patterns in the data distribution. (3) even subtle variations in caption fidelity have significant effects on the quality of learned representations. These findings collectively emphasize the profound impact of caption quality on model performance and highlight the need for more sophisticated robust training algorithm in T2I. In response to these observations, we propose a approach leveraging VLM confidence score to mitigate caption noise, thereby enhancing the robustness of T2I models against hallucination in caption.
Related papers
- Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model [0.0]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content.
These models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image.
We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions [31.637204677787576]
We introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding.
KnowAda minimizes hallucinations while preserving high descriptiveness.
Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations.
arXiv Detail & Related papers (2024-11-13T20:50:04Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Generating Faithful and Salient Text from Multimodal Data [24.866158772311522]
We develop a framework to generate faithful and salient text from mixed-modal data.
We train a small vision critic model to identify hallucinated and non-salient features from the image modality.
Experiments on two datasets show that our framework improves LMMs' generation quality on both faithfulness and saliency.
arXiv Detail & Related papers (2024-09-06T00:59:10Z) - ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis [6.066100464517522]
We introduce the Abstractive News Captions with High-level cOntext Representation dataset, containing 70K+ samples sourced from 5 different news media organizations.
Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights.
It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR.
arXiv Detail & Related papers (2024-04-15T21:19:10Z) - Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution [49.762034744605955]
We propose a multi-modal information bottleneck approach to improve interpretability of vision-language models.
We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models.
arXiv Detail & Related papers (2023-12-28T18:02:22Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.