From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
- URL: http://arxiv.org/abs/2406.13912v1
- Date: Thu, 20 Jun 2024 01:03:13 GMT
- Title: From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
- Authors: Yusuke Hirota, Ryo Hachiuma, Chao-Han Huck Yang, Yuta Nakashima,
- Abstract summary: Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text.
We show that enriched captions suffer from increased gender bias and hallucination.
This study serves as a caution against the trend of making descriptive captions more descriptive.
- Score: 26.211648382676856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.
Related papers
- The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning.
Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z) - PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training [56.172959986096316]
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs)
HalFscore is a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level.
PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations.
arXiv Detail & Related papers (2025-03-09T07:07:03Z) - Improving face generation quality and prompt following with synthetic captions [57.47448046728439]
We introduce a training-free pipeline designed to generate accurate appearance descriptions from images of people.
We then use these synthetic captions to fine-tune a text-to-image diffusion model.
Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces.
arXiv Detail & Related papers (2024-05-17T15:50:53Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Inserting Faces inside Captions: Image Captioning with Attention Guided Merging [0.0]
We introduce AstroCaptions, a dataset for the image captioning task.
We propose a novel post-processing method to insert identified people's names inside the caption.
arXiv Detail & Related papers (2024-03-20T08:38:25Z) - Mitigating Open-Vocabulary Caption Hallucinations [33.960405731583656]
We propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting.
Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations.
To mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa.
arXiv Detail & Related papers (2023-12-06T17:28:03Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z) - Understanding and Evaluating Racial Biases in Image Captioning [18.184279793253634]
We study bias propagation pathways within image captioning, focusing specifically on the COCO dataset.
We demonstrate differences in caption performance, sentiment, and word choice between images of lighter versus darker-skinned people.
arXiv Detail & Related papers (2021-06-16T01:07:24Z) - Structural and Functional Decomposition for Personality Image Captioning
in a Communication Game [53.74847926974122]
Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait.
We introduce a novel formulation for PIC based on a communication game between a speaker and a listener.
arXiv Detail & Related papers (2020-11-17T10:19:27Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.