Eye vs. AI: Human Gaze and Model Attention in Video Memorability
- URL: http://arxiv.org/abs/2311.16484v1
- Date: Sun, 26 Nov 2023 05:14:06 GMT
- Title: Eye vs. AI: Human Gaze and Model Attention in Video Memorability
- Authors: Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar
- Abstract summary: We propose a Transformer-based model with naturalistic-temporal attention that matches SoTA performance on video memorability prediction.
We compare model attention against human gaze fixation density maps collected through a small-scale eye-tracking experiment.
We observe that the model assigns greater importance to the initial frames, mimicking temporal attention patterns found in humans.
- Score: 22.718191366938278
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding the factors that determine video memorability has important
applications in areas such as educational technology and advertising. Towards
this goal, we investigate the semantic and temporal attention mechanisms
underlying video memorability. We propose a Transformer-based model with
spatio-temporal attention that matches SoTA performance on video memorability
prediction on a large naturalistic video dataset. More importantly, the
self-attention patterns show us where the model looks to predict memorability.
We compare model attention against human gaze fixation density maps collected
through a small-scale eye-tracking experiment where humans perform a video
memory task. Quantitative saliency metrics show that the model attention and
human gaze follow similar patterns. Furthermore, while panoptic segmentation
confirms that the model and humans attend more to thing classes, stuff classes
that receive increased/decreased attention tend to have higher memorability
scores. We also observe that the model assigns greater importance to the
initial frames, mimicking temporal attention patterns found in humans.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Do Transformer Models Show Similar Attention Patterns to Task-Specific
Human Gaze? [0.0]
Self-attention functions in state-of-the-art NLP models often correlate with human attention.
We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention.
arXiv Detail & Related papers (2022-04-25T08:23:13Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - Beyond Tracking: Using Deep Learning to Discover Novel Interactions in
Biological Swarms [3.441021278275805]
We propose training deep network models to predict system-level states directly from generic graphical features from the entire view.
Because the resulting predictive models are not based on human-understood predictors, we use explanatory modules.
This represents an example of augmented intelligence in behavioral ecology -- knowledge co-creation in a human-AI team.
arXiv Detail & Related papers (2021-08-20T22:50:41Z) - Gaze Perception in Humans and CNN-Based Model [66.89451296340809]
We compare how a CNN (convolutional neural network) based model of gaze and humans infer the locus of attention in images of real-world scenes.
We show that compared to the model, humans' estimates of the locus of attention are more influenced by the context of the scene.
arXiv Detail & Related papers (2021-04-17T04:52:46Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Multimodal Memorability: Modeling Effects of Semantics and Decay on
Video Memorability [17.00485879591431]
We develop a predictive model of human visual event memory and how those memories decay over time.
We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays.
arXiv Detail & Related papers (2020-09-05T17:24:02Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.