Related papers: Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

URL: http://arxiv.org/abs/2311.00047v1
Date: Tue, 31 Oct 2023 18:01:11 GMT
Title: Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?
Authors: Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai
Abstract summary: Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. Do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? We build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs.
Score: 28.654771227396807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at https://github.com/vl-illusion/dataset.

Related papers

Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? [12.157632635072435]
Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes.<n>Research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions.
arXiv Detail & Related papers (2025-06-06T05:47:50Z)
Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI [4.58733012283457]
We introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity.<n>We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy.<n>Our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision.
arXiv Detail & Related papers (2025-05-27T12:22:59Z)
IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models [56.34742191010987]
Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions. We introduce IllusionBench, a comprehensive visual illusion dataset that encompasses classic cognitive illusions and real-world scene illusions. We design trap illusions that resemble classical patterns but differ in reality, highlighting issues in SOTA models.
arXiv Detail & Related papers (2025-01-01T14:10:25Z)
The Art of Deception: Color Visual Illusions and Diffusion Models [55.830105086695]
Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions. We show how visual illusions are encoded in diffusion models. We also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models.
arXiv Detail & Related papers (2024-12-13T13:07:08Z)
Evaluating Model Perception of Color Illusions in Photorealistic Scenes [16.421832484760987]
We study the perception of color illusions by vision-language models. We propose an automated framework for generating color illusion images. Experiments show that all studied VLMs exhibit perceptual biases similar human vision.
arXiv Detail & Related papers (2024-12-09T03:49:10Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM) Recent studies show that VLMs are vulnerable to hallucination. We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z)
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces. VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z)
Are Vision Language Models Texture or Shape Biased and Can We Steer Them? [29.837399598519557]
We study the texture vs. shape bias in vision language models (VLMs) We find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone.
arXiv Detail & Related papers (2024-03-14T09:07:14Z)
BRI3L: A Brightness Illusion Image Dataset for Identification and Localization of Regions of Illusory Perception [4.685953126232505]
We develop a dataset of visual illusions and benchmark using data-driven approach for illusion classification and localization. We consider five types of brightness illusions: 1) Hermann grid, 2) Simultaneous Contrast, 3) White illusion, 4) Grid illusion, and 5) Induced Grating illusion. The application of deep learning model, it is shown, also generalizes over unseen brightness illusions like brightness assimilation to contrast transitions.
arXiv Detail & Related papers (2024-02-07T02:57:40Z)
Improving generalization by mimicking the human visual diet [34.32585612888424]
We present a new perspective on bridging the generalization gap between biological and computer vision. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations.
arXiv Detail & Related papers (2022-06-15T20:32:24Z)
Can machines learn to see without visual databases? [93.73109506642112]
This paper focuses on developing machines that learn to see without needing to handle visual databases. This might open the doors to a truly competitive track concerning deep learning technologies for vision.
arXiv Detail & Related papers (2021-10-12T13:03:54Z)
S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data. We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z)
What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense [142.53911271465344]
We argue that the next generation of AI must embrace "dark" humanlike common sense for solving novel tasks. We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense.
arXiv Detail & Related papers (2020-04-20T04:07:28Z)
CRAVES: Controlling Robotic Arm with a Vision-based Economic System [96.56564257199474]
Training a robotic arm to accomplish real-world tasks has been attracting increasing attention in both academia and industry.<n>This work discusses the role of computer vision algorithms in this field.<n>We present an alternative solution, which uses a 3D model to create a large number of synthetic data.
arXiv Detail & Related papers (2018-12-03T13:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.