Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?
- URL: http://arxiv.org/abs/2506.05765v2
- Date: Wed, 11 Jun 2025 02:03:57 GMT
- Title: Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?
- Authors: Taiga Shinozaki, Tomoki Doi, Amane Watahiki, Satoshi Nishida, Hitomi Yanaka,
- Abstract summary: Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes.<n>Research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions.
- Score: 12.157632635072435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at https://github.com/ynklab/FILM
Related papers
- ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI [4.58733012283457]
We introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity.<n>We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy.<n>Our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision.
arXiv Detail & Related papers (2025-05-27T12:22:59Z) - IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models [56.34742191010987]
Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions.<n>We introduce IllusionBench, a comprehensive visual illusion dataset that encompasses classic cognitive illusions and real-world scene illusions.<n>We design trap illusions that resemble classical patterns but differ in reality, highlighting issues in SOTA models.
arXiv Detail & Related papers (2025-01-01T14:10:25Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - The Art of Deception: Color Visual Illusions and Diffusion Models [55.830105086695]
Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions.<n>We show how visual illusions are encoded in diffusion models.<n>We also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models.
arXiv Detail & Related papers (2024-12-13T13:07:08Z) - The Illusion-Illusion: Vision Language Models See Illusions Where There are None [0.0]
I show that many current vision language systems mistakenly see illusory-illusions as illusions.<n>I suggest that such failures are part of broader failures already discussed in the literature.
arXiv Detail & Related papers (2024-12-07T03:30:51Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs [52.497823009176074]
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations.<n>We introduce Visual Description Grounded Decoding (VDGD), a training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.
arXiv Detail & Related papers (2024-05-24T16:21:59Z) - Grounding Visual Illusions in Language: Do Vision-Language Models
Perceive Illusions Like Humans? [28.654771227396807]
Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world.
Do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality?
We build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs.
arXiv Detail & Related papers (2023-10-31T18:01:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.