Actor-identified Spatiotemporal Action Detection -- Detecting Who Is
Doing What in Videos
- URL: http://arxiv.org/abs/2208.12940v1
- Date: Sat, 27 Aug 2022 06:51:12 GMT
- Title: Actor-identified Spatiotemporal Action Detection -- Detecting Who Is
Doing What in Videos
- Authors: Fan Yang, Norimichi Ukita, Sakriani Sakti, Satoshi Nakamura
- Abstract summary: Temporal Action Detection (TAD) has been investigated for estimating the start and end time for each action in videos.
Spatiotemporal Action Detection (SAD) has been studied for localizing the action both spatially and temporally in videos.
We propose a novel task, Actor-identified Spatiotemporal Action Detection (ASAD) to bridge the gap between SAD actor identification.
- Score: 29.5205455437899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of deep learning on video Action Recognition (AR) has motivated
researchers to progressively promote related tasks from the coarse level to the
fine-grained level. Compared with conventional AR that only predicts an action
label for the entire video, Temporal Action Detection (TAD) has been
investigated for estimating the start and end time for each action in videos.
Taking TAD a step further, Spatiotemporal Action Detection (SAD) has been
studied for localizing the action both spatially and temporally in videos.
However, who performs the action, is generally ignored in SAD, while
identifying the actor could also be important. To this end, we propose a novel
task, Actor-identified Spatiotemporal Action Detection (ASAD), to bridge the
gap between SAD and actor identification.
In ASAD, we not only detect the spatiotemporal boundary for instance-level
action but also assign the unique ID to each actor. To approach ASAD, Multiple
Object Tracking (MOT) and Action Classification (AC) are two fundamental
elements. By using MOT, the spatiotemporal boundary of each actor is obtained
and assigned to a unique actor identity. By using AC, the action class is
estimated within the corresponding spatiotemporal boundary. Since ASAD is a new
task, it poses many new challenges that cannot be addressed by existing
methods: i) no dataset is specifically created for ASAD, ii) no evaluation
metrics are designed for ASAD, iii) current MOT performance is the bottleneck
to obtain satisfactory ASAD results. To address those problems, we contribute
to i) annotate a new ASAD dataset, ii) propose ASAD evaluation metrics by
considering multi-label actions and actor identification, iii) improve the data
association strategies in MOT to boost the MOT performance, which leads to
better ASAD results. The code is available at
\url{https://github.com/fandulu/ASAD}.
Related papers
- Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - JOADAA: joint online action detection and action anticipation [2.7792814152937027]
Action anticipation involves forecasting future actions by connecting past events to future ones.
Online action detection is the task of predicting actions in a streaming manner.
By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information.
arXiv Detail & Related papers (2023-09-12T11:17:25Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - A Spatio-Temporal Identity Verification Method for Person-Action
Instance Search in Movies [32.76347250146175]
Person-Action Instance Search (INS) aims to retrieve shots with specific person carrying out specific action from massive video shots.
Direct aggregation of two individual INS scores cannot guarantee the identity consistency between person and action.
We propose an identity consistency verification scheme to optimize the direct fusion score of person INS and action INS.
arXiv Detail & Related papers (2021-10-30T11:00:47Z) - Towards High-Quality Temporal Action Detection with Sparse Proposals [14.923321325749196]
Temporal Action Detection aims to localize the temporal segments containing human action instances and predict the action categories.
We introduce Sparse Proposals to interact with the hierarchical features.
Experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds.
arXiv Detail & Related papers (2021-09-18T06:15:19Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.