In the Eye of the Beholder: Gaze and Actions in First Person Video
- URL: http://arxiv.org/abs/2006.00626v2
- Date: Sat, 31 Oct 2020 05:00:32 GMT
- Title: In the Eye of the Beholder: Gaze and Actions in First Person Video
- Authors: Yin Li, Miao Liu, James M. Rehg
- Abstract summary: We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera.
Our dataset comes with videos, gaze tracking data, hand masks and action annotations.
We propose a novel deep model for joint gaze estimation and action recognition in First Person Vision.
- Score: 30.54510882243602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the task of jointly determining what a person is doing and where
they are looking based on the analysis of video captured by a headworn camera.
To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our
dataset comes with videos, gaze tracking data, hand masks and action
annotations, thereby providing the most comprehensive benchmark for First
Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model
for joint gaze estimation and action recognition in FPV. Our method describes
the participant's gaze as a probabilistic variable and models its distribution
using stochastic units in a deep network. We further sample from these
stochastic units, generating an attention map to guide the aggregation of
visual features for action recognition. Our method is evaluated on our EGTEA
Gaze+ dataset and achieves a performance level that exceeds the
state-of-the-art by a significant margin. More importantly, we demonstrate that
our model can be applied to larger scale FPV dataset---EPIC-Kitchens even
without using gaze, offering new state-of-the-art results on FPV action
recognition.
Related papers
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data.
We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Open-Vocabulary Object Detection via Scene Graph Discovery [53.27673119360868]
Open-vocabulary (OV) object detection has attracted increasing research attention.
We propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection.
arXiv Detail & Related papers (2023-07-07T00:46:19Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - Visual Object Tracking in First Person Vision [33.62651949312872]
The study is made possible through the introduction of TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences.
Our results show that object tracking in FPV poses new challenges to current visual trackers.
arXiv Detail & Related papers (2022-09-27T16:18:47Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Is First Person Vision Challenging for Object Tracking? [32.64792520537041]
We present the first systematic study of object tracking in First Person Vision (FPV)
Our study extensively analyses the performance of recent visual trackers and baseline FPV trackers with respect to different aspects and considering a new performance measure.
Our results show that object tracking in FPV is challenging, which suggests that more research efforts should be devoted to this problem.
arXiv Detail & Related papers (2021-08-31T08:06:01Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Towards End-to-end Video-based Eye-Tracking [50.0630362419371]
Estimating eye-gaze from images alone is a challenging task due to un-observable person-specific factors.
We propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships.
We demonstrate that the fusion of information from visual stimuli as well as eye images can lead towards achieving performance similar to literature-reported figures.
arXiv Detail & Related papers (2020-07-26T12:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.