Action Transformer: A Self-Attention Model for Short-Time Human Action
Recognition
- URL: http://arxiv.org/abs/2107.00606v2
- Date: Fri, 2 Jul 2021 09:33:48 GMT
- Title: Action Transformer: A Self-Attention Model for Short-Time Human Action
Recognition
- Authors: Vittorio Mazzia, Simone Angarano, Francesco Salvetti, Federico
Angelini and Marcello Chiaberge
- Abstract summary: Action Transformer (AcT) is a self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers.
AcT exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance.
- Score: 5.123810256000945
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep neural networks based purely on attention have been successful across
several domains, relying on minimal architectural priors from the designer. In
Human Action Recognition (HAR), attention mechanisms have been primarily
adopted on top of standard convolutional or recurrent layers, improving the
overall generalization capability. In this work, we introduce Action
Transformer (AcT), a simple, fully self-attentional architecture that
consistently outperforms more elaborated networks that mix convolutional,
recurrent, and attentive layers. In order to limit computational and energy
requests, building on previous human action recognition research, the proposed
approach exploits 2D pose representations over small temporal windows,
providing a low latency solution for accurate and effective real-time
performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as
an attempt to build a formal training and evaluation benchmark for real-time
short-time human action recognition. Extensive experimentation on MPOSE2021
with our proposed methodology and several previous architectural solutions
proves the effectiveness of the AcT model and poses the base for future work on
HAR.
Related papers
- VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view.
Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network.
We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
arXiv Detail & Related papers (2023-07-08T06:49:54Z) - Surrogate-assisted Multi-objective Neural Architecture Search for
Real-time Semantic Segmentation [11.866947846619064]
neural architecture search (NAS) has emerged as a promising avenue toward automating the design of architectures.
We propose a surrogate-assisted multi-objective method to address the challenges of applying NAS to semantic segmentation.
Our method can identify architectures significantly outperforming existing state-of-the-art architectures designed both manually by human experts and automatically by other NAS methods.
arXiv Detail & Related papers (2022-08-14T10:18:51Z) - Human Activity Recognition Using Cascaded Dual Attention CNN and
Bi-Directional GRU Framework [3.3721926640077795]
Vision-based human activity recognition has emerged as one of the essential research areas in video analytics domain.
This paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for human activity recognition.
The proposed framework attains an improvement in execution time up to 167 times in terms of frames per second as compared to most of the contemporary action recognition methods.
arXiv Detail & Related papers (2022-08-09T20:34:42Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Transformer Inertial Poser: Attention-based Real-time Human Motion
Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time.
Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z) - ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z) - Dynamic Iterative Refinement for Efficient 3D Hand Pose Estimation [87.54604263202941]
We propose a tiny deep neural network of which partial layers are iteratively exploited for refining its previous estimations.
We employ learned gating criteria to decide whether to exit from the weight-sharing loop, allowing per-sample adaptation in our model.
Our method consistently outperforms state-of-the-art 2D/3D hand pose estimation approaches in terms of both accuracy and efficiency for widely used benchmarks.
arXiv Detail & Related papers (2021-11-11T23:31:34Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Joint Learning of Neural Transfer and Architecture Adaptation for Image
Recognition [77.95361323613147]
Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset.
In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness.
Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
arXiv Detail & Related papers (2021-03-31T08:15:17Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z) - Attention-Based Deep Learning Framework for Human Activity Recognition
with User Adaptation [5.629161809575013]
Sensor-based human activity recognition (HAR) requires to predict the action of a person based on sensor-generated time series data.
We propose a novel deep learning framework, algname, based on a purely attention-based mechanism.
We show that our proposed attention-based architecture is considerably more powerful than previous approaches.
arXiv Detail & Related papers (2020-06-06T09:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.