HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
- URL: http://arxiv.org/abs/2310.14566v5
- Date: Mon, 25 Mar 2024 06:05:24 GMT
- Title: HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
- Authors: Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou,
- Abstract summary: We introduce HallusionBench, a benchmark for the evaluation of image-context reasoning.
The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts.
In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V.
- Score: 69.52245481329899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.
Related papers
- A Unified Hallucination Mitigation Framework for Large Vision-Language Models [18.595958586621943]
We present a unified framework, Dentist, for hallucination mitigation.
The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result.
On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality.
arXiv Detail & Related papers (2024-09-24T22:36:58Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models [91.78328878860003]
Large vision-language models (LVLMs) are prone to hallucinations.
benchmarks often rely on hand-crafted corner cases whose failure patterns may not generalize well.
We develop AutoHallusion, the first automated benchmark generation approach.
arXiv Detail & Related papers (2024-06-16T11:44:43Z) - Mitigating Hallucination in Visual Language Models with Visual
Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot.
Key problem lies in its weak ability to comprehend detailed content in a multi-modal context.
In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z) - Q-Instruct: Improving Low-level Visual Abilities for Multi-modality
Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images.
We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z) - Holistic Analysis of Hallucination in GPT-4V(ision): Bias and
Interference Challenges [54.42256219010956]
This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference.
bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data.
interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented.
arXiv Detail & Related papers (2023-11-06T17:26:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.