Related papers: Detecting Attended Visual Targets in Video

Detecting Attended Visual Targets in Video

URL: http://arxiv.org/abs/2003.02501v2
Date: Mon, 30 Mar 2020 23:38:48 GMT
Title: Detecting Attended Visual Targets in Video
Authors: Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M. Rehg
Abstract summary: We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. We obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.
Score: 25.64146711657225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the problem of detecting attention targets in video. Our goal is to identify where each person in each frame of a video is looking, and correctly handle the case where the gaze target is out-of-frame. Our novel architecture models the dynamic interaction between the scene and head features and infers time-varying attention targets. We introduce a new annotated dataset, VideoAttentionTarget, containing complex and dynamic patterns of real-world gaze behavior. Our experiments show that our model can effectively infer dynamic attention in videos. In addition, we apply our predicted attention maps to two social gaze behavior recognition tasks, and show that the resulting classifiers significantly outperform existing methods. We achieve state-of-the-art performance on three datasets: GazeFollow (static images), VideoAttentionTarget (videos), and VideoCoAtt (videos), and obtain the first results for automatically classifying clinically-relevant gaze behavior without wearable cameras or eye trackers.

Related papers

Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability [21.44002657362493]
We adopt a simple CNN+Transformer architecture that enables analysis of features while fuse-temporal attention matching state-of-the-art (TASo) performance on video memorability prediction. We compare model attention against human fixations through a small-scale eye-tracking study where humans perform a memory memory task.
arXiv Detail & Related papers (2023-11-26T05:14:06Z)
Decoding Attention from Gaze: A Benchmark Dataset and End-to-End Models [6.642042615005632]
Eye-tracking has potential to provide rich behavioral data about human cognition in ecologically valid environments. This paper studies using computer vision tools for "attention decoding", the task of assessing the locus of a participant's overt visual attention over time.
arXiv Detail & Related papers (2022-11-20T12:24:57Z)
ViA: View-invariant Skeleton Action Representation Learning via Motion Retargeting [10.811088895926776]
ViA is a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data. Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy.
arXiv Detail & Related papers (2022-08-31T18:49:38Z)
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions [81.88294320397826]
A system does not know what human-object interactions are present in a video as or the actual location of the human and object. We introduce a dataset comprising over 6.5k videos with human-object interaction that have been curated from sentence captions. We demonstrate improved performance over weakly supervised baselines adapted to our annotations on our video dataset.
arXiv Detail & Related papers (2021-10-07T15:30:18Z)
HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z)
ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD) We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
arXiv Detail & Related papers (2021-07-24T15:14:20Z)
Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z)
Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames. We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z)
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos. What-Where-When (W3) video attention module models all three facets of video attention jointly. Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z)
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.