PhD: A Prompted Visual Hallucination Evaluation Dataset
- URL: http://arxiv.org/abs/2403.11116v1
- Date: Sun, 17 Mar 2024 06:53:44 GMT
- Title: PhD: A Prompted Visual Hallucination Evaluation Dataset
- Authors: Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Xirong Li,
- Abstract summary: We study the Intrinsic Vision-Language Hallucination (IVL-Hallu) issue, thoroughly analyzing different types of IVL-Hallu on their causes and reflections.
Specifically, we propose several novel IVL-Hallu tasks and categorize them into four types: (a) object hallucination, (b) attribute hallucination, (c) multi-modal conflicting hallucination, and (d) counter-common-sense hallucination.
Based on these, we propose a more challenging benchmark named PhD to evaluate and explore IVL-Hallu.
- Score: 23.49107734068849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid growth of Large Language Models (LLMs) has driven the development of Large Vision-Language Models (LVLMs). The challenge of hallucination, prevalent in LLMs, also emerges in LVLMs. However, most existing efforts mainly focus on object hallucination in LVLM, ignoring diverse types of LVLM hallucinations. In this study, we delve into the Intrinsic Vision-Language Hallucination (IVL-Hallu) issue, thoroughly analyzing different types of IVL-Hallu on their causes and reflections. Specifically, we propose several novel IVL-Hallu tasks and categorize them into four types: (a) object hallucination, which arises from the misidentification of objects, (b) attribute hallucination, which is caused by the misidentification of attributes, (c) multi-modal conflicting hallucination, which derives from the contradictions between textual and visual information, and (d) counter-common-sense hallucination, which owes to the contradictions between the LVLM knowledge and actual images. Based on these taxonomies, we propose a more challenging benchmark named PhD to evaluate and explore IVL-Hallu. An automated pipeline is proposed for generating different types of IVL-Hallu data. Extensive experiments on five SOTA LVLMs reveal their inability to effectively tackle our proposed IVL-Hallu tasks, with detailed analyses and insights on the origins and possible solutions of these new challenging IVL-Hallu tasks, facilitating future researches on IVL-Hallu and LVLM. The benchmark can be accessed at \href{https://github.com/jiazhen-code/IntrinsicHallu}{this https URL}.
Related papers
- A Survey of Multimodal Hallucination Evaluation and Detection [52.03164192840023]
Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information.<n>These models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge.<n>This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks.
arXiv Detail & Related papers (2025-07-25T07:22:42Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models [30.037505914306504]
Vision-Language Models (LVLMs) demonstrate remarkable capabilities in multimodal tasks.<n>LVLMs generate inaccurate visual object-related information based on the query input, potentially leading to misinformation and concerns about safety and reliability.<n>In this paper, we analyze each component of LLaVA-like LVLMs to identify potential sources of error and their impact.
arXiv Detail & Related papers (2025-05-04T01:47:58Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.
These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.
We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs.
Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types.
Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z) - MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [21.562034852024272]
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data.
Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets.
We introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs.
arXiv Detail & Related papers (2024-07-03T00:59:03Z) - Visual Hallucination: Definition, Quantification, and Prescriptive Remediations [5.980832131162941]
hallucination presents perhaps the most significant impediment to the advancement of AI.
We offer a fine-grained profiling of hallucination based on two tasks: i) image captioning, and ii) Visual Question Answering (VQA)
We curate a dataset comprising 2,000 samples generated using eight tasks of captioning and VQA along with human annotations for the discourse.
arXiv Detail & Related papers (2024-03-26T01:28:42Z) - Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination [14.25488878224697]
We propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics.
Pensieve mitigates the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores.
arXiv Detail & Related papers (2024-03-21T13:49:42Z) - Benchmarking Hallucination in Large Language Models based on
Unanswerable Math Word Problem [58.3723958800254]
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks.
They are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination.
This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP)
arXiv Detail & Related papers (2024-03-06T09:06:34Z) - DelucionQA: Detecting Hallucinations in Domain-specific Question
Answering [22.23664008053246]
Hallucination is a well-known phenomenon in text generated by large language models (LLMs)
We introduce a dataset, DelucionQA, that captures hallucinations made by retrieval-augmented LLMs for a domain-specific QA task.
We propose a set of hallucination detection methods to serve as baselines for future works from the research community.
arXiv Detail & Related papers (2023-12-08T17:41:06Z) - HallE-Control: Controlling Object Hallucination in Large Multimodal Models [80.03697683629035]
We introduce $textitCCEval$, a GPT-4 assisted evaluation method for detailed captioning.
While LMMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations.
Our method reduces hallucination by 44% compared to LLaVA$_7B$ and maintains the object coverage.
arXiv Detail & Related papers (2023-10-03T04:01:27Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.