Related papers: Multi-Object Hallucination in Vision-Language Models

Multi-Object Hallucination in Vision-Language Models

URL: http://arxiv.org/abs/2407.06192v1
Date: Mon, 8 Jul 2024 17:59:57 GMT
Title: Multi-Object Hallucination in Vision-Language Models
Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai,
Abstract summary: Large vision language models (LVLMs) often suffer from object hallucination. Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors.
Score: 28.135215173793785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.

Related papers

MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models [73.20126092411776]
We conduct the first systematic study of hallucinations in multi-image MLLMs.<n>We propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images.<n>MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination.
arXiv Detail & Related papers (2025-08-01T15:49:29Z)
Stop learning it all to mitigate visual hallucination, Focus on the hallucination target [0.10571493942475592]
Multimodal Large Language Models (MLLMs) frequently suffer from hallucination issues.<n> hallucinations undermine model reliability in practical applications.<n>Mymethod is a preference learning approach that mitigates hallucinations by focusing on targeted areas.
arXiv Detail & Related papers (2025-06-13T02:35:03Z)
A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models [30.037505914306504]
Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks.<n>LVLMs generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability.<n>In this paper, we analyze each component of LLaVA-like LVLMs to identify potential sources of error and their impact.
arXiv Detail & Related papers (2025-05-04T01:47:58Z)
Understanding and Evaluating Hallucinations in 3D Visual Language Models [42.355169504378246]
3D-LLMs have been proposed to tackle complex tasks in embodied intelligence and scene understanding. They are significantly affected by hallucinations. This work presents the first systematic study of hallucinations in 3D-LLMs.
arXiv Detail & Related papers (2025-02-18T07:15:43Z)
Evaluating Hallucination in Large Vision-Language Models based on Context-Aware Object Similarities [5.602853217226167]
We present Context-Aware Object Similarities (CAOS), a novel approach for evaluating object hallucination in large vision-language models (LVLMs) CAOS integrates object statistics with semantic relationships between objects in captions and ground-truth data. To address this, we further employ language model-based object recognition to detect potentially out-of-domain hallucinated objects.
arXiv Detail & Related papers (2025-01-25T03:03:18Z)
Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks. These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images. We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks. This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens [7.806633929976787]
Hallucinations in Large Vision-Language Models (LVLMs) significantly undermine their reliability. This paper addresses how LVLMs process visual information and whether this process causes hallucination. We propose a simple inference-time method that adjusts visual attention by integrating information across various heads.
arXiv Detail & Related papers (2024-11-23T03:40:05Z)
Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models [22.42712853647949]
We present an in-depth investigation into the object hallucination problem specifically within the CLIP model. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. We show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
arXiv Detail & Related papers (2024-10-04T06:24:49Z)
Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks. We present QL-Bench, a benchmark settings to simulate human responses to low-level vision. We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z)
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models [52.957842999317506]
Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. We propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs.
arXiv Detail & Related papers (2024-02-18T15:28:39Z)
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models [67.8024390595066]
NOPE (Negative Object Presence Evaluation) is a novel benchmark designed to assess object hallucination in vision-language (VL) models. We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions.
arXiv Detail & Related papers (2023-10-09T01:52:27Z)
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z)
Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs) We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions. We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text. We show that models achieving better scores on standard metrics could hallucinate objects more frequently. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.