Human-Object Interaction Prediction in Videos through Gaze Following
- URL: http://arxiv.org/abs/2306.03597v1
- Date: Tue, 6 Jun 2023 11:36:14 GMT
- Title: Human-Object Interaction Prediction in Videos through Gaze Following
- Authors: Zhifan Ni, Esteve Valls Mascar\'o, Hyemin Ahn, Dongheui Lee
- Abstract summary: We design a framework to detect current HOIs and anticipate future HOIs in videos.
We propose to leverage human information since people often fixate on an object before interacting with it.
Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life.
- Score: 9.61701724661823
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding the human-object interactions (HOIs) from a video is essential
to fully comprehend a visual scene. This line of research has been addressed by
detecting HOIs from images and lately from videos. However, the video-based HOI
anticipation task in the third-person view remains understudied. In this paper,
we design a framework to detect current HOIs and anticipate future HOIs in
videos. We propose to leverage human gaze information since people often fixate
on an object before interacting with it. These gaze features together with the
scene contexts and the visual appearances of human-object pairs are fused
through a spatio-temporal transformer. To evaluate the model in the HOI
anticipation task in a multi-person scenario, we propose a set of person-wise
multi-label metrics. Our model is trained and validated on the VidHOI dataset,
which contains videos capturing daily life and is currently the largest video
HOI dataset. Experimental results in the HOI detection task show that our
approach improves the baseline by a great margin of 36.3% relatively. Moreover,
we conduct an extensive ablation study to demonstrate the effectiveness of our
modifications and extensions to the spatio-temporal transformer. Our code is
publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.
Related papers
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Interaction Region Visual Transformer for Egocentric Action Anticipation [18.873728614415946]
We propose a novel way to represent human-object interactions for egocentric action anticipation.
We model interactions between hands and objects using Spatial Cross-Attention.
We then infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens.
Using these tokens, we construct an interaction-centric video representation for action anticipation.
arXiv Detail & Related papers (2022-11-25T15:00:51Z) - Estimation of Appearance and Occupancy Information in Birds Eye View
from Surround Monocular Images [2.69840007334476]
Birds-eye View (BEV) expresses the location of different traffic participants in the ego vehicle frame from a top-down view.
We propose a novel representation that captures various traffic participants appearance and occupancy information from an array of monocular cameras covering 360 deg field of view (FOV)
We use a learned image embedding of all camera images to generate a BEV of the scene at any instant that captures both appearance and occupancy of the scene.
arXiv Detail & Related papers (2022-11-08T20:57:56Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos [13.25502885135043]
Analyzing the interactions between humans and objects from a video includes identification of relationships between humans and the objects present in the video.
We present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture truth at multiple granularities in a video.
We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO.
arXiv Detail & Related papers (2020-12-17T05:44:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.