Grounding Visual Illusions in Language: Do Vision-Language Models
Perceive Illusions Like Humans?
- URL: http://arxiv.org/abs/2311.00047v1
- Date: Tue, 31 Oct 2023 18:01:11 GMT
- Title: Grounding Visual Illusions in Language: Do Vision-Language Models
Perceive Illusions Like Humans?
- Authors: Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai
- Abstract summary: Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world.
Do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality?
We build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs.
- Score: 28.654771227396807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) are trained on vast amounts of data captured by
humans emulating our understanding of the world. However, known as visual
illusions, human's perception of reality isn't always faithful to the physical
world. This raises a key question: do VLMs have the similar kind of illusions
as humans do, or do they faithfully learn to represent reality? To investigate
this question, we build a dataset containing five types of visual illusions and
formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our
findings have shown that although the overall alignment is low, larger models
are closer to human perception and more susceptible to visual illusions. Our
dataset and initial findings will promote a better understanding of visual
illusions in humans and machines and provide a stepping stone for future
computational models that can better align humans and machines in perceiving
and communicating about the shared visual world. The code and data are
available at https://github.com/vl-illusion/dataset.
Related papers
- When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - Are Vision Language Models Texture or Shape Biased and Can We Steer Them? [29.837399598519557]
We study the texture vs. shape bias in vision language models (VLMs)
We find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text.
For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone.
arXiv Detail & Related papers (2024-03-14T09:07:14Z) - BRI3L: A Brightness Illusion Image Dataset for Identification and
Localization of Regions of Illusory Perception [4.685953126232505]
We develop a dataset of visual illusions and benchmark using data-driven approach for illusion classification and localization.
We consider five types of brightness illusions: 1) Hermann grid, 2) Simultaneous Contrast, 3) White illusion, 4) Grid illusion, and 5) Induced Grating illusion.
The application of deep learning model, it is shown, also generalizes over unseen brightness illusions like brightness assimilation to contrast transitions.
arXiv Detail & Related papers (2024-02-07T02:57:40Z) - Improving generalization by mimicking the human visual diet [34.32585612888424]
We present a new perspective on bridging the generalization gap between biological and computer vision.
Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations.
arXiv Detail & Related papers (2022-06-15T20:32:24Z) - Can machines learn to see without visual databases? [93.73109506642112]
This paper focuses on developing machines that learn to see without needing to handle visual databases.
This might open the doors to a truly competitive track concerning deep learning technologies for vision.
arXiv Detail & Related papers (2021-10-12T13:03:54Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z) - Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike
Common Sense [142.53911271465344]
We argue that the next generation of AI must embrace "dark" humanlike common sense for solving novel tasks.
We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense.
arXiv Detail & Related papers (2020-04-20T04:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.