Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training
- URL: http://arxiv.org/abs/2210.07688v1
- Date: Fri, 14 Oct 2022 10:27:22 GMT
- Title: Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training
- Authors: Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, Pascale Fung
- Abstract summary: Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
- Score: 66.0036211069513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale vision-language pre-trained (VLP) models are prone to hallucinate
non-existent visual objects when generating text based on visual information.
In this paper, we exhaustively probe the object hallucination problem from
three aspects. First, we examine various state-of-the-art VLP models, showing
that models achieving better scores on standard metrics(e.g., BLEU-4, CIDEr)
could hallucinate objects more frequently. Second, we investigate how different
types of visual features in VLP influence hallucination, including
region-based, grid-based, and patch-based. Surprisingly, we find that
patch-based features perform the best and smaller patch resolution yields a
non-trivial reduction in object hallucination. Third, we decouple various VLP
objectives and demonstrate their effectiveness in alleviating object
hallucination. Based on that, we propose a new pre-training loss, object masked
language modeling, to further reduce object hallucination. We evaluate models
on both COCO (in-domain) and NoCaps (out-of-domain) datasets with our improved
CHAIR metric. Furthermore, we investigate the effects of various text decoding
strategies and image augmentation methods on object hallucination.
Related papers
- From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models [15.401221354325672]
Hallucinations in large vision models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input.
Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to extract or decouple visual features.
In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling)
arXiv Detail & Related papers (2024-10-09T11:46:32Z) - Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models [22.42712853647949]
We present an in-depth investigation into the object hallucination problem specifically within the CLIP model.
We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities.
We show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
arXiv Detail & Related papers (2024-10-04T06:24:49Z) - HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [48.065569871444275]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.
We generate a small-size hallucination annotation dataset by proprietary models.
Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.