Spatio-Temporal Action Detection with Multi-Object Interaction
- URL: http://arxiv.org/abs/2004.00180v1
- Date: Wed, 1 Apr 2020 00:54:56 GMT
- Title: Spatio-Temporal Action Detection with Multi-Object Interaction
- Authors: Huijuan Xu, Lizhi Yang, Stan Sclaroff, Kate Saenko, Trevor Darrell
- Abstract summary: In this paper, we study the S-temporal action detection problem with multi-object interaction.
We introduce a new dataset that is spatially annotated with action tubes containing multi-object interactions.
We propose an end-to-endtemporal action detection model that performs both spatial and temporal regression simultaneously.
- Score: 127.85524354900494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal action detection in videos requires localizing the action
both spatially and temporally in the form of an "action tube". Nowadays, most
spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are
annotated with action tubes that contain a single person performing the action,
thus the predominant action detection models simply employ a person detection
and tracking pipeline for localization. However, when the action is defined as
an interaction between multiple objects, such methods may fail since each
bounding box in the action tube contains multiple objects instead of one
person. In this paper, we study the spatio-temporal action detection problem
with multi-object interaction. We introduce a new dataset that is annotated
with action tubes containing multi-object interactions. Moreover, we propose an
end-to-end spatio-temporal action detection model that performs both spatial
and temporal regression simultaneously. Our spatial regression may enclose
multiple objects participating in the action. During test time, we simply
connect the regressed bounding boxes within the predicted temporal duration
using a simple heuristic. We report the baseline results of our proposed model
on this new dataset, and also show competitive results on the standard
benchmark UCF101-24 using only RGB input.
Related papers
- Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [59.87033229815062]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.
Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.
We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling [8.463489896549161]
Two-stage Video localization (VAD) is a formidable task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip.
We propose a two-stage VAD framework called Joint Actor-scene context Relation modeling (JARViS)
JARViS consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.
arXiv Detail & Related papers (2024-08-07T08:08:08Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - STMixer: A One-Stage Sparse Action Detector [43.62159663367588]
We propose two core designs for a more flexible one-stage action detector.
First, we sparse a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of features from the entire video-temporal domain.
Second, we devise a decoupled feature mixing module, which dynamically attends to mixes along the spatial and temporal dimensions respectively for better feature decoding.
arXiv Detail & Related papers (2024-04-15T14:52:02Z) - Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - Spatio-Temporal Context for Action Detection [2.294635424666456]
This work proposes to use non-aggregated temporal information.
The main contribution is the introduction of two cross attention blocks.
Experiments on the AVA dataset show the advantages of the proposed approach.
arXiv Detail & Related papers (2021-06-29T08:33:48Z) - Spatiotemporal Deformable Models for Long-Term Complex Activity
Detection [23.880673582575856]
Long-term complex activity recognition can be crucial for autonomous systems such as cars and surgical robots.
Most current methods are designed to merely localise short-term action/activities or combinations of actions that only last for a few frames or seconds.
Our framework consists of three main building blocks: (i) action detection, (ii) the modelling of the deformable geometry of parts, and (iii) a sparsity mechanism.
arXiv Detail & Related papers (2021-04-16T16:05:34Z) - Unsupervised Domain Adaptation for Spatio-Temporal Action Localization [69.12982544509427]
S-temporal action localization is an important problem in computer vision.
We propose an end-to-end unsupervised domain adaptation algorithm.
We show that significant performance gain can be achieved when spatial and temporal features are adapted separately or jointly.
arXiv Detail & Related papers (2020-10-19T04:25:10Z) - A Spatial-Temporal Attentive Network with Spatial Continuity for
Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC)
First, spatial-temporal attention mechanism is presented to explore the most useful and important information.
Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.