Mitigating Open-Vocabulary Caption Hallucinations
- URL: http://arxiv.org/abs/2312.03631v4
- Date: Wed, 16 Oct 2024 19:35:55 GMT
- Title: Mitigating Open-Vocabulary Caption Hallucinations
- Authors: Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor,
- Abstract summary: We propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting.
Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations.
To mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa.
- Score: 33.960405731583656
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. Code and models can be found at: https://github.com/assafbk/mocha_code
Related papers
- Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.
In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training [56.172959986096316]
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs)
HalFscore is a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level.
PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations.
arXiv Detail & Related papers (2025-03-09T07:07:03Z) - Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow [32.039946174953236]
Large vision-language models show tremendous potential in understanding visual information through human languages.
They are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image.
We propose Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing hallucination noise.
arXiv Detail & Related papers (2025-02-28T05:56:23Z) - Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models [24.241691571850403]
Large Vision-Language Models (LVLMs) integrate image encoders with Large Language Models (LLMs) to process multi-modal inputs and perform complex visual tasks.
They often generate hallucinations by describing non-existent objects or attributes, compromising their reliability.
This study analyzes hallucination patterns in image captioning, showing that not all tokens in the generation process are influenced by image input.
arXiv Detail & Related papers (2025-02-24T05:00:52Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.
We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning.
They often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination.
Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage.
We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfD
arXiv Detail & Related papers (2024-11-24T13:42:02Z) - HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [48.065569871444275]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.
We generate a small-size hallucination annotation dataset by proprietary models.
Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - ALOHa: A New Measure for Hallucination in Captioning Models [61.007542765171586]
Existing metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms.
We propose a modernized open-vocabulary metric, ALOHa, which leverages large language models (LLMs) to measure object hallucinations.
We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations.
arXiv Detail & Related papers (2024-04-03T17:59:36Z) - ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models [6.014286500397164]
Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions.
We introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens.
Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric.
arXiv Detail & Related papers (2024-03-24T14:21:06Z) - EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models [27.679307570206937]
We propose an efficient fine-grained unlearning framework (EFUF) to eliminate hallucinations without paired data.
Our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead.
arXiv Detail & Related papers (2024-02-15T08:58:03Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Let there be a clock on the beach: Reducing Object Hallucination in
Image Captioning [12.354076490479516]
Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning.
This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans.
We propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase in the model size.
arXiv Detail & Related papers (2021-10-04T20:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.