Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training
- URL: http://arxiv.org/abs/2210.07688v1
- Date: Fri, 14 Oct 2022 10:27:22 GMT
- Title: Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training
- Authors: Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, Pascale Fung
- Abstract summary: Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
- Score: 66.0036211069513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale vision-language pre-trained (VLP) models are prone to hallucinate
non-existent visual objects when generating text based on visual information.
In this paper, we exhaustively probe the object hallucination problem from
three aspects. First, we examine various state-of-the-art VLP models, showing
that models achieving better scores on standard metrics(e.g., BLEU-4, CIDEr)
could hallucinate objects more frequently. Second, we investigate how different
types of visual features in VLP influence hallucination, including
region-based, grid-based, and patch-based. Surprisingly, we find that
patch-based features perform the best and smaller patch resolution yields a
non-trivial reduction in object hallucination. Third, we decouple various VLP
objectives and demonstrate their effectiveness in alleviating object
hallucination. Based on that, we propose a new pre-training loss, object masked
language modeling, to further reduce object hallucination. We evaluate models
on both COCO (in-domain) and NoCaps (out-of-domain) datasets with our improved
CHAIR metric. Furthermore, we investigate the effects of various text decoding
strategies and image augmentation methods on object hallucination.
Related papers
- Multi-Object Hallucination in Vision-Language Models [28.135215173793785]
Large vision language models (LVLMs) often suffer from object hallucination.
Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors.
arXiv Detail & Related papers (2024-07-08T17:59:57Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models [91.78328878860003]
Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects.
We develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse examples.
It generates image-based questions whose ground-truth answers contradict the language module's prior.
A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations.
arXiv Detail & Related papers (2024-06-16T11:44:43Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [48.065569871444275]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.
We generate a small-size hallucination annotation dataset by proprietary models.
Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Mitigating Hallucination in Visual Language Models with Visual
Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot.
Key problem lies in its weak ability to comprehend detailed content in a multi-modal context.
In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.