Contrastive Language-Image Pretrained Models are Zero-Shot Human
Scanpath Predictors
- URL: http://arxiv.org/abs/2305.12380v2
- Date: Tue, 23 May 2023 11:17:32 GMT
- Title: Contrastive Language-Image Pretrained Models are Zero-Shot Human
Scanpath Predictors
- Authors: Dario Zanca, Andrea Zugarini, Simon Dietz, Thomas R. Altstidl, Mark A.
Turban Ndjeuha, Leo Schwinn, Bjoern Eskofier
- Abstract summary: CapMIT1003 is a database of captions and click-contingent image explorations collected during captioning tasks.
NevaClip is a novel zero-shot method for predicting visual scanpaths.
- Score: 2.524526956420465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the mechanisms underlying human attention is a fundamental
challenge for both vision science and artificial intelligence. While numerous
computational models of free-viewing have been proposed, less is known about
the mechanisms underlying task-driven image exploration. To address this gap,
we present CapMIT1003, a database of captions and click-contingent image
explorations collected during captioning tasks. CapMIT1003 is based on the same
stimuli from the well-known MIT1003 benchmark, for which eye-tracking data
under free-viewing conditions is available, which offers a promising
opportunity to concurrently study human attention under both tasks. We make
this dataset publicly available to facilitate future research in this field. In
addition, we introduce NevaClip, a novel zero-shot method for predicting visual
scanpaths that combines contrastive language-image pretrained (CLIP) models
with biologically-inspired neural visual attention (NeVA) algorithms. NevaClip
simulates human scanpaths by aligning the representation of the foveated visual
stimulus and the representation of the associated caption, employing
gradient-driven visual exploration to generate scanpaths. Our experimental
results demonstrate that NevaClip outperforms existing unsupervised
computational models of human visual attention in terms of scanpath
plausibility, for both captioning and free-viewing tasks. Furthermore, we show
that conditioning NevaClip with incorrect or misleading captions leads to
random behavior, highlighting the significant impact of caption guidance in the
decision-making process. These findings contribute to a better understanding of
mechanisms that guide human attention and pave the way for more sophisticated
computational approaches to scanpath prediction that can integrate direct
top-down guidance of downstream tasks.
Related papers
- Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision [3.3295510777293837]
We introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during the captioning task.
We also present NevaClip, a zero-shot method for predicting visual scanpaths by combining CLIP models with NeVA algorithms.
arXiv Detail & Related papers (2024-08-19T12:41:46Z) - GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths [20.384132849805003]
We introduce GazeXplain, a novel study of visual scanpath prediction and explanation.
This involves annotating natural-language explanations for fixations across eye-tracking datasets.
Experiments on diverse eye-tracking datasets demonstrate the effectiveness of GazeXplain in both scanpath prediction and explanation.
arXiv Detail & Related papers (2024-08-05T19:11:46Z) - Automatic Discovery of Visual Circuits [66.99553804855931]
We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept.
We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
arXiv Detail & Related papers (2024-04-22T17:00:57Z) - Unidirectional brain-computer interface: Artificial neural network
encoding natural images to fMRI response in the visual cortex [12.1427193917406]
We propose an artificial neural network dubbed VISION to mimic the human brain and show how it can foster neuroscientific inquiries.
VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%.
arXiv Detail & Related papers (2023-09-26T15:38:26Z) - Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images
with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition.
Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images.
We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z) - Simulating Human Gaze with Neural Visual Attention [44.65733084492857]
We propose the Neural Visual Attention (NeVA) algorithm to integrate guidance of any downstream visual task into attention modeling.
We observe that biologically constrained neural networks generate human-like scanpaths without being trained for this objective.
arXiv Detail & Related papers (2022-11-22T09:02:09Z) - An Inter-observer consistent deep adversarial training for visual
scanpath prediction [66.46953851227454]
We propose an inter-observer consistent adversarial training approach for scanpath prediction through a lightweight deep neural network.
We show the competitiveness of our approach in regard to state-of-the-art methods.
arXiv Detail & Related papers (2022-11-14T13:22:29Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Behind the Machine's Gaze: Biologically Constrained Neural Networks
Exhibit Human-like Visual Attention [40.878963450471026]
We propose the Neural Visual Attention (NeVA) algorithm to generate visual scanpaths in a top-down manner.
We show that the proposed method outperforms state-of-the-art unsupervised human attention models in terms of similarity to human scanpaths.
arXiv Detail & Related papers (2022-04-19T18:57:47Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.