A Survey on Deep Learning-based Spatio-temporal Action Detection
- URL: http://arxiv.org/abs/2308.01618v1
- Date: Thu, 3 Aug 2023 08:48:14 GMT
- Title: A Survey on Deep Learning-based Spatio-temporal Action Detection
- Authors: Peng Wang, Fanwei Zeng, Yuntao Qian
- Abstract summary: STAD aims to classify the actions present in a video and localize them in space and time.
It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications.
This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD.
- Score: 8.456482280676884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal action detection (STAD) aims to classify the actions present
in a video and localize them in space and time. It has become a particularly
active area of research in computer vision because of its explosively emerging
real-world applications, such as autonomous driving, visual surveillance,
entertainment, etc. Many efforts have been devoted in recent years to building
a robust and effective framework for STAD. This paper provides a comprehensive
review of the state-of-the-art deep learning-based methods for STAD. Firstly, a
taxonomy is developed to organize these methods. Next, the linking algorithms,
which aim to associate the frame- or clip-level detection results together to
form action tubes, are reviewed. Then, the commonly used benchmark datasets and
evaluation metrics are introduced, and the performance of state-of-the-art
models is compared. At last, this paper is concluded, and a set of potential
research directions of STAD are discussed.
Related papers
- Deep Learning for Video Anomaly Detection: A Review [52.74513211976795]
Video anomaly detection (VAD) aims to discover behaviors or events deviating from the normality in videos.
In the era of deep learning, a great variety of deep learning based methods are constantly emerging for the VAD task.
This review covers the spectrum of five different categories, namely, semi-supervised, weakly supervised, fully supervised, unsupervised and open-set supervised VAD.
arXiv Detail & Related papers (2024-09-09T07:31:16Z) - Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization [3.996503381756227]
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations.
We propose a novel framework that aligns human action knowledge and semantic knowledge in a probabilistic embedding space.
Our method significantly outperforms all previous state-of-the-art methods.
arXiv Detail & Related papers (2024-08-12T07:09:12Z) - Understanding active learning of molecular docking and its applications [0.6554326244334868]
We investigate how active learning methodologies effectively predict docking scores using only 2D structures.
Our findings suggest that surrogate models tend to memorize structural patterns prevalent in high docking scored compounds.
Our comprehensive analysis underscores the reliability and potential applicability of active learning methodologies in virtual screening campaigns.
arXiv Detail & Related papers (2024-06-14T05:43:42Z) - Temporal Action Segmentation: An Analysis of Modern Techniques [43.725939095985915]
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in minutes-long videos with multiple action classes.
Despite the rapid growth of TAS techniques in recent years, no systematic survey has been conducted in these sectors.
This survey analyzes and summarizes the most significant contributions and trends.
arXiv Detail & Related papers (2022-10-19T07:40:47Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Recent Few-Shot Object Detection Algorithms: A Survey with Performance
Comparison [54.357707168883024]
Few-Shot Object Detection (FSOD) mimics the humans' ability of learning to learn.
FSOD intelligently transfers the learned generic object knowledge from the common heavy-tailed, to the novel long-tailed object classes.
We give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols.
arXiv Detail & Related papers (2022-03-27T04:11:28Z) - Deep Learning Schema-based Event Extraction: Literature Review and
Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot.
This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z) - Exploring Temporal Context and Human Movement Dynamics for Online Action
Detection in Videos [32.88517041655816]
Temporal context and human movement dynamics can be effectively employed for online action detection.
Our approach uses various state-of-the-art architectures and appropriately combines the extracted features in order to improve action detection.
arXiv Detail & Related papers (2021-06-26T08:34:19Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Joint Geographical and Temporal Modeling based on Matrix Factorization
for Point-of-Interest Recommendation [6.346772579930929]
Point-of-Interest (POI) recommendation has become an important task, which learns the users' preferences and mobility patterns to recommend POIs.
Previous studies show that incorporating contextual information such as geographical and temporal influences is necessary to improve POI recommendation.
arXiv Detail & Related papers (2020-01-24T12:25:37Z) - A Comprehensive Study on Temporal Modeling for Online Action Detection [50.558313106389335]
Online action detection (OAD) is a practical yet challenging task, which has attracted increasing attention in recent years.
This paper aims to provide a comprehensive study on temporal modeling for OAD including four meta types of temporal modeling methods.
We present several hybrid temporal modeling methods, which outperform the recent state-of-the-art methods with sizable margins on THUMOS-14 and TVSeries.
arXiv Detail & Related papers (2020-01-21T13:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.