Related papers: Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

URL: http://arxiv.org/abs/2311.16484v2
Date: Tue, 05 Nov 2024 16:25:05 GMT
Title: Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability
Authors: Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar,
Abstract summary: We adopt a simple CNN+Transformer architecture that enables analysis of features while fuse-temporal attention matching state-of-the-art (TASo) performance on video memorability prediction. We compare model attention against human fixations through a small-scale eye-tracking study where humans perform a memory memory task.
Score: 21.44002657362493
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding what makes a video memorable has important applications in advertising or education technology. Towards this goal, we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features, we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model, trained only to predict a memorability score, exhibits similar spatial attention patterns to human gaze, especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video, mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.

Related papers

Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z)
A domain adaptive deep learning solution for scanpath prediction of paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings. We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans. The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z)
Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze? [0.0]
Self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention.
arXiv Detail & Related papers (2022-04-25T08:23:13Z)
STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond. Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z)
Gaze Perception in Humans and CNN-Based Model [66.89451296340809]
We compare how a CNN (convolutional neural network) based model of gaze and humans infer the locus of attention in images of real-world scenes. We show that compared to the model, humans' estimates of the locus of attention are more influenced by the context of the scene.
arXiv Detail & Related papers (2021-04-17T04:52:46Z)
Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z)
GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner. Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z)
Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability [17.00485879591431]
We develop a predictive model of human visual event memory and how those memories decay over time. We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays.
arXiv Detail & Related papers (2020-09-05T17:24:02Z)
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos. What-Where-When (W3) video attention module models all three facets of video attention jointly. Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z)
Detecting Attended Visual Targets in Video [25.64146711657225]
We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. We obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
arXiv Detail & Related papers (2020-03-05T09:29:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.