Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability
- URL: http://arxiv.org/abs/2311.16484v2
- Date: Tue, 05 Nov 2024 16:25:05 GMT
- Title: Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability
- Authors: Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar,
- Abstract summary: We adopt a simple CNN+Transformer architecture that enables analysis of features while fuse-temporal attention matching state-of-the-art (TASo) performance on video memorability prediction.
We compare model attention against human fixations through a small-scale eye-tracking study where humans perform a memory memory task.
- Score: 21.44002657362493
- License:
- Abstract: Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
Related papers
- Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction.
Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Do Transformer Models Show Similar Attention Patterns to Task-Specific
Human Gaze? [0.0]
Self-attention functions in state-of-the-art NLP models often correlate with human attention.
We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention.
arXiv Detail & Related papers (2022-04-25T08:23:13Z) - STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z) - Gaze Perception in Humans and CNN-Based Model [66.89451296340809]
We compare how a CNN (convolutional neural network) based model of gaze and humans infer the locus of attention in images of real-world scenes.
We show that compared to the model, humans' estimates of the locus of attention are more influenced by the context of the scene.
arXiv Detail & Related papers (2021-04-17T04:52:46Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Multimodal Memorability: Modeling Effects of Semantics and Decay on
Video Memorability [17.00485879591431]
We develop a predictive model of human visual event memory and how those memories decay over time.
We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays.
arXiv Detail & Related papers (2020-09-05T17:24:02Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z) - Detecting Attended Visual Targets in Video [25.64146711657225]
We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior.
Our experiments show that our model can effectively infer dynamic attention in videos.
We obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
arXiv Detail & Related papers (2020-03-05T09:29:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.