Do humans and Convolutional Neural Networks attend to similar areas
during scene classification: Effects of task and image type
- URL: http://arxiv.org/abs/2307.13345v2
- Date: Sun, 15 Oct 2023 13:35:56 GMT
- Title: Do humans and Convolutional Neural Networks attend to similar areas
during scene classification: Effects of task and image type
- Authors: Romy M\"uller, Marcel D\"urschmidt, Julian Ullrich, Carsten Knoll,
Sascha Weber, Steffen Seitz
- Abstract summary: We investigated how the tasks used to elicit human attention maps interact with image characteristics in modulating the similarity between humans and CNN.
We varied the type of image to be categorized, using either singular, salient objects, indoor scenes consisting of object arrangements, or landscapes without distinct objects defining the category.
The influence of human tasks strongly depended on image type: For objects, human manual selection produced maps that were most similar to CNN, while the specific eye movement task has little impact.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Learning models like Convolutional Neural Networks (CNN) are powerful
image classifiers, but what factors determine whether they attend to similar
image areas as humans do? While previous studies have focused on technological
factors, little is known about the role of factors that affect human attention.
In the present study, we investigated how the tasks used to elicit human
attention maps interact with image characteristics in modulating the similarity
between humans and CNN. We varied the intentionality of human tasks, ranging
from spontaneous gaze during categorization over intentional gaze-pointing up
to manual area selection. Moreover, we varied the type of image to be
categorized, using either singular, salient objects, indoor scenes consisting
of object arrangements, or landscapes without distinct objects defining the
category. The human attention maps generated in this way were compared to the
CNN attention maps revealed by explainable artificial intelligence (Grad-CAM).
The influence of human tasks strongly depended on image type: For objects,
human manual selection produced maps that were most similar to CNN, while the
specific eye movement task has little impact. For indoor scenes, spontaneous
gaze produced the least similarity, while for landscapes, similarity was
equally low across all human tasks. To better understand these results, we also
compared the different human attention maps to each other. Our results
highlight the importance of taking human factors into account when comparing
the attention of humans and CNN.
Related papers
- Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Neural Novel Actor: Learning a Generalized Animatable Neural
Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons.
The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z) - Guiding Visual Attention in Deep Convolutional Neural Networks Based on
Human Eye Movements [0.0]
Deep Convolutional Neural Networks (DCNNs) were originally inspired by principles of biological vision.
Recent advances in deep learning seem to decrease this similarity.
We investigate a purely data-driven approach to obtain useful models.
arXiv Detail & Related papers (2022-06-21T17:59:23Z) - Passive attention in artificial neural networks predicts human visual
selectivity [8.50463394182796]
We show that passive attention techniques reveal a significant overlap with human visual selectivity estimates.
We validate these correlational results with causal manipulations using recognition experiments.
This work contributes a new approach to evaluating the biological and psychological validity of leading ANNs as models of human vision.
arXiv Detail & Related papers (2021-07-14T21:21:48Z) - Gaze Perception in Humans and CNN-Based Model [66.89451296340809]
We compare how a CNN (convolutional neural network) based model of gaze and humans infer the locus of attention in images of real-world scenes.
We show that compared to the model, humans' estimates of the locus of attention are more influenced by the context of the scene.
arXiv Detail & Related papers (2021-04-17T04:52:46Z) - HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences [60.89437526374286]
Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts.
We propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels.
Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space.
arXiv Detail & Related papers (2021-03-29T12:43:44Z) - Fooling the primate brain with minimal, targeted image manipulation [67.78919304747498]
We propose an array of methods for creating minimal, targeted image perturbations that lead to changes in both neuronal activity and perception as reflected in behavior.
Our work shares the same goal with adversarial attack, namely the manipulation of images with minimal, targeted noise that leads ANN models to misclassify the images.
arXiv Detail & Related papers (2020-11-11T08:30:54Z) - Seeing eye-to-eye? A comparison of object recognition performance in
humans and deep convolutional neural networks under image manipulation [0.0]
This study aims towards a behavioral comparison of visual core object recognition performance between humans and feedforward neural networks.
Analyses of accuracy revealed that humans not only outperform DCNNs on all conditions, but also display significantly greater robustness towards shape and most notably color alterations.
arXiv Detail & Related papers (2020-07-13T10:26:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.