ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos
- URL: http://arxiv.org/abs/2105.11731v1
- Date: Tue, 25 May 2021 07:54:35 GMT
- Title: ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos
- Authors: Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann and
Jiashi Feng
- Abstract summary: We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
- Score: 91.29436920371003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting human-object interactions (HOI) is an important step toward a
comprehensive visual understanding of machines. While detecting non-temporal
HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely
even for humans to guess temporal-related HOIs (e.g., opening/closing a door)
from a single video frame, where the neighboring frames play an essential role.
However, conventional HOI methods operating on only static images have been
used to predict temporal-related interactions, which is essentially guessing
without temporal contexts and may lead to sub-optimal performance. In this
paper, we bridge this gap by detecting video-based HOIs with explicit temporal
information. We first show that a naive temporal-aware variant of a common
action detection baseline does not work on video-based HOIs due to a
feature-inconsistency issue. We then propose a simple yet effective
architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal
information such as human and object trajectories, correctly-localized visual
features, and spatial-temporal masking pose features. We construct a new video
HOI benchmark dubbed VidHOI where our proposed approach serves as a solid
baseline.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object
Interaction detection [30.896749712316222]
This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as atemporal graph with human and object nodes as input.
We achieve state-of-the-art performance on CAD-120 and Something-Else dataset.
arXiv Detail & Related papers (2022-06-07T07:26:06Z) - STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos [13.25502885135043]
Analyzing the interactions between humans and objects from a video includes identification of relationships between humans and the objects present in the video.
We present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture truth at multiple granularities in a video.
We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO.
arXiv Detail & Related papers (2020-12-17T05:44:07Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.