Spatial-Temporal Alignment Network for Action Recognition and Detection
- URL: http://arxiv.org/abs/2012.02426v1
- Date: Fri, 4 Dec 2020 06:23:40 GMT
- Title: Spatial-Temporal Alignment Network for Action Recognition and Detection
- Authors: Junwei Liang, Liangliang Cao, Xuehan Xiong, Ting Yu, Alexander
Hauptmann
- Abstract summary: This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection.
We propose a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection.
We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets.
- Score: 80.19235282200697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies how to introduce viewpoint-invariant feature
representations that can help action recognition and detection. Although we
have witnessed great progress of action recognition in the past decade, it
remains challenging yet interesting how to efficiently model the geometric
variations in large scale datasets. This paper proposes a novel
Spatial-Temporal Alignment Network (STAN) that aims to learn geometric
invariant representations for action recognition and action detection. The STAN
model is very light-weighted and generic, which could be plugged into existing
action recognition models like ResNet3D and the SlowFast with a very low extra
computational cost. We test our STAN model extensively on AVA, Kinetics-400,
AVA-Kinetics, Charades, and Charades-Ego datasets. The experimental results
show that the STAN model can consistently improve the state of the arts in both
action detection and action recognition tasks. We will release our data, models
and code.
Related papers
- SOAR: Advancements in Small Body Object Detection for Aerial Imagery Using State Space Models and Programmable Gradients [0.8873228457453465]
Small object detection in aerial imagery presents significant challenges in computer vision.
Traditional methods using transformer-based models often face limitations stemming from the lack of specialized databases.
This paper introduces two innovative approaches that significantly enhance detection and segmentation capabilities for small aerial objects.
arXiv Detail & Related papers (2024-05-02T19:47:08Z) - Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for
Advanced Object Detection [55.2480439325792]
We present an in-depth evaluation of an object detection model that integrates the LSKNet backbone with the DiffusionDet head.
The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement.
This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis.
arXiv Detail & Related papers (2023-11-21T19:49:13Z) - Spatial-Temporal Alignment Network for Action Recognition [5.2170672727035345]
This paper studies introducing viewpoint invariant feature representations in existing action recognition architecture.
We propose a novel Spatial-Temporal Alignment Network (STAN), which explicitly learns geometric invariant representations for action recognition.
We test our STAN model on widely-used datasets like UCF101 and HMDB51.
arXiv Detail & Related papers (2023-08-19T03:31:57Z) - Texture-Based Input Feature Selection for Action Recognition [3.9596068699962323]
We propose a novel method to determine the task-irrelevant content in inputs which increases the domain discrepancy.
We show that our proposed model is superior to existing models for action recognition on the HMDB-51 dataset and the Penn Action dataset.
arXiv Detail & Related papers (2023-02-28T23:56:31Z) - Baby Physical Safety Monitoring in Smart Home Using Action Recognition
System [0.0]
We present a novel framework combining transfer learning techniques with a Conv2D LSTM layer to extract features from the pre-trained I3D model on the Kinetics dataset.
We developed a benchmark dataset and an automated model that uses LSTM convolution with I3D (ConvLSTM-I3D) for recognizing and predicting baby activities in a smart baby room.
arXiv Detail & Related papers (2022-10-22T19:00:14Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object
Manipulation [135.10594078615952]
We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects.
A benchmark contains over 17,000 action trajectories with six types of plush toys and 78 variants.
Our model achieves the best performance in geometry, correspondence, and dynamics predictions.
arXiv Detail & Related papers (2022-03-14T04:56:55Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - DeepActsNet: Spatial and Motion features from Face, Hands, and Body
Combined with Convolutional and Graph Networks for Improved Action
Recognition [10.690794159983199]
We present "Deep Action Stamps (DeepActs)", a novel data representation to encode actions from video sequences.
We also present "DeepActsNet", a deep learning based ensemble model which learns convolutional and structural features from Deep Action Stamps for highly accurate action recognition.
arXiv Detail & Related papers (2020-09-21T12:41:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.