IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models
- URL: http://arxiv.org/abs/2501.00848v1
- Date: Wed, 01 Jan 2025 14:10:25 GMT
- Title: IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models
- Authors: Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min,
- Abstract summary: Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions.
We introduce IllusionBench, a comprehensive visual illusion dataset that encompasses classic cognitive illusions and real-world scene illusions.
We design trap illusions that resemble classical patterns but differ in reality, highlighting issues in SOTA models.
- Score: 56.34742191010987
- License:
- Abstract: Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions, especially in real-world scenarios. Existing benchmarks focus on classical cognitive illusions, which have been learned by state-of-the-art (SOTA) VLMs, revealing issues such as hallucinations and limited perceptual abilities. To address this gap, we introduce IllusionBench, a comprehensive visual illusion dataset that encompasses not only classic cognitive illusions but also real-world scene illusions. This dataset features 1,051 images, 5,548 question-answer pairs, and 1,051 golden text descriptions that address the presence, causes, and content of the illusions. We evaluate ten SOTA VLMs on this dataset using true-or-false, multiple-choice, and open-ended tasks. In addition to real-world illusions, we design trap illusions that resemble classical patterns but differ in reality, highlighting hallucination issues in SOTA models. The top-performing model, GPT-4o, achieves 80.59% accuracy on true-or-false tasks and 76.75% on multiple-choice questions, but still lags behind human performance. In the semantic description task, GPT-4o's hallucinations on classical illusions result in low scores for trap illusions, even falling behind some open-source models. IllusionBench is, to the best of our knowledge, the largest and most comprehensive benchmark for visual illusions in VLMs to date.
Related papers
- The Art of Deception: Color Visual Illusions and Diffusion Models [55.830105086695]
Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions.
We show how visual illusions are encoded in diffusion models.
We also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models.
arXiv Detail & Related papers (2024-12-13T13:07:08Z) - The Illusion-Illusion: Vision Language Models See Illusions Where There are None [0.0]
I show that many current vision language systems mistakenly see illusory-illusions as illusions.
I suggest that such failures are part of broader failures already discussed in the literature.
arXiv Detail & Related papers (2024-12-07T03:30:51Z) - Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image.
These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption.
Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z) - BRI3L: A Brightness Illusion Image Dataset for Identification and
Localization of Regions of Illusory Perception [4.685953126232505]
We develop a dataset of visual illusions and benchmark using data-driven approach for illusion classification and localization.
We consider five types of brightness illusions: 1) Hermann grid, 2) Simultaneous Contrast, 3) White illusion, 4) Grid illusion, and 5) Induced Grating illusion.
The application of deep learning model, it is shown, also generalizes over unseen brightness illusions like brightness assimilation to contrast transitions.
arXiv Detail & Related papers (2024-02-07T02:57:40Z) - Grounding Visual Illusions in Language: Do Vision-Language Models
Perceive Illusions Like Humans? [28.654771227396807]
Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world.
Do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality?
We build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs.
arXiv Detail & Related papers (2023-10-31T18:01:11Z) - HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models [69.52245481329899]
We introduce HallusionBench, a benchmark for the evaluation of image-context reasoning.
The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts.
In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V.
arXiv Detail & Related papers (2023-10-23T04:49:09Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Plausible May Not Be Faithful: Probing Object Hallucination in
Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text.
We show that models achieving better scores on standard metrics could hallucinate objects more frequently.
Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.