Let there be a clock on the beach: Reducing Object Hallucination in
Image Captioning
- URL: http://arxiv.org/abs/2110.01705v1
- Date: Mon, 4 Oct 2021 20:25:22 GMT
- Title: Let there be a clock on the beach: Reducing Object Hallucination in
Image Captioning
- Authors: Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas
- Abstract summary: Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning.
This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans.
We propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase in the model size.
- Score: 12.354076490479516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Explaining an image with missing or non-existent objects is known as object
bias (hallucination) in image captioning. This behaviour is quite common in the
state-of-the-art captioning models which is not desirable by humans. To
decrease the object hallucination in captioning, we propose three simple yet
efficient training augmentation method for sentences which requires no new
training data or increase in the model size. By extensive analysis, we show
that the proposed methods can significantly diminish our models' object bias on
hallucination metrics. Moreover, we experimentally demonstrate that our methods
decrease the dependency on the visual features. All of our code, configuration
files and model weights will be made public.
Related papers
- PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training [56.172959986096316]
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs)
HalFscore is a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level.
PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations.
arXiv Detail & Related papers (2025-03-09T07:07:03Z) - Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow [32.039946174953236]
Large vision-language models show tremendous potential in understanding visual information through human languages.
They are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image.
We propose Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing hallucination noise.
arXiv Detail & Related papers (2025-02-28T05:56:23Z) - Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models [22.42712853647949]
We present an in-depth investigation into the object hallucination problem specifically within the CLIP model.
We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities.
We show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
arXiv Detail & Related papers (2024-10-04T06:24:49Z) - See or Guess: Counterfactually Regularized Image Captioning [32.82695612178604]
We present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable.
Our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models.
arXiv Detail & Related papers (2024-08-29T17:59:57Z) - Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image.
These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption.
Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z) - ALOHa: A New Measure for Hallucination in Captioning Models [61.007542765171586]
Existing metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms.
We propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations.
We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations.
arXiv Detail & Related papers (2024-04-03T17:59:36Z) - Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models [57.42800112251644]
We focus on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures.
We devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods.
arXiv Detail & Related papers (2024-03-03T02:31:11Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Mitigating Open-Vocabulary Caption Hallucinations [33.960405731583656]
We propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting.
Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations.
To mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa.
arXiv Detail & Related papers (2023-12-06T17:28:03Z) - Reducing Hallucinations in Neural Machine Translation with Feature
Attribution [54.46113444757899]
We present a case study focusing on model understanding and regularisation to reduce hallucinations in NMT.
We first use feature attribution methods to study the behaviour of an NMT model that produces hallucinations.
We then leverage these methods to propose a novel loss function that substantially helps reduce hallucinations and does not require retraining the model from scratch.
arXiv Detail & Related papers (2022-11-17T20:33:56Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.