Eye vs. AI: Human Gaze and Model Attention in Video Memorability
- URL: http://arxiv.org/abs/2311.16484v1
- Date: Sun, 26 Nov 2023 05:14:06 GMT
- Title: Eye vs. AI: Human Gaze and Model Attention in Video Memorability
- Authors: Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar
- Abstract summary: We propose a Transformer-based model with naturalistic-temporal attention that matches SoTA performance on video memorability prediction.
We compare model attention against human gaze fixation density maps collected through a small-scale eye-tracking experiment.
We observe that the model assigns greater importance to the initial frames, mimicking temporal attention patterns found in humans.
- Score: 22.718191366938278
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding the factors that determine video memorability has important
applications in areas such as educational technology and advertising. Towards
this goal, we investigate the semantic and temporal attention mechanisms
underlying video memorability. We propose a Transformer-based model with
spatio-temporal attention that matches SoTA performance on video memorability
prediction on a large naturalistic video dataset. More importantly, the
self-attention patterns show us where the model looks to predict memorability.
We compare model attention against human gaze fixation density maps collected
through a small-scale eye-tracking experiment where humans perform a video
memory task. Quantitative saliency metrics show that the model attention and
human gaze follow similar patterns. Furthermore, while panoptic segmentation
confirms that the model and humans attend more to thing classes, stuff classes
that receive increased/decreased attention tend to have higher memorability
scores. We also observe that the model assigns greater importance to the
initial frames, mimicking temporal attention patterns found in humans.
Related papers
- Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction.
Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Do Transformer Models Show Similar Attention Patterns to Task-Specific
Human Gaze? [0.0]
Self-attention functions in state-of-the-art NLP models often correlate with human attention.
We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention.
arXiv Detail & Related papers (2022-04-25T08:23:13Z) - STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z) - Gaze Perception in Humans and CNN-Based Model [66.89451296340809]
We compare how a CNN (convolutional neural network) based model of gaze and humans infer the locus of attention in images of real-world scenes.
We show that compared to the model, humans' estimates of the locus of attention are more influenced by the context of the scene.
arXiv Detail & Related papers (2021-04-17T04:52:46Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Multimodal Memorability: Modeling Effects of Semantics and Decay on
Video Memorability [17.00485879591431]
We develop a predictive model of human visual event memory and how those memories decay over time.
We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays.
arXiv Detail & Related papers (2020-09-05T17:24:02Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z) - Detecting Attended Visual Targets in Video [25.64146711657225]
We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior.
Our experiments show that our model can effectively infer dynamic attention in videos.
We obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
arXiv Detail & Related papers (2020-03-05T09:29:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.