VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task
- URL: http://arxiv.org/abs/2511.13420v1
- Date: Mon, 17 Nov 2025 14:32:06 GMT
- Title: VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task
- Authors: Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen,
- Abstract summary: We introduce Voluntary-imagined Object Presence Evaluation (VOPE) to assess LVLMs' hallucinations in voluntary imagination tasks.<n>VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response.<n> consistency between the model's interpretation and the object's presence in the image is then used to determine whether the model hallucinates when generating the response.
- Score: 73.75049937317506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs' hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model's interpretation and the object's presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.
Related papers
- Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models [22.996176483599868]
We develop a framework to measure the object and relation hallucination in Large Vision-Language Models (LVLMs) simultaneously.<n>Based on our framework, we introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark.
arXiv Detail & Related papers (2024-10-30T15:25:06Z) - Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models [22.42712853647949]
We present an in-depth investigation into the object hallucination problem specifically within the CLIP model.
We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities.
We show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
arXiv Detail & Related papers (2024-10-04T06:24:49Z) - Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image.
These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption.
Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z) - Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs [52.497823009176074]
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations.<n>We introduce Visual Description Grounded Decoding (VDGD), a training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.
arXiv Detail & Related papers (2024-05-24T16:21:59Z) - A Survey on Hallucination in Large Vision-Language Models [18.540878498840435]
Large Vision-Language Models (LVLMs) have attracted growing attention within the AI landscape for its practical implementation potential.
However, hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing LVLMs.
We dissect LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation.
arXiv Detail & Related papers (2024-02-01T00:33:21Z) - Evaluation and Analysis of Hallucination in Large Vision-Language Models [49.19829480199372]
Large Vision-Language Models (LVLMs) have recently achieved remarkable success.
LVLMs are still plagued by the hallucination problem.
Hallucination refers to the information of LVLMs' responses that does not exist in the visual input.
arXiv Detail & Related papers (2023-08-29T08:51:24Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.