Action Transformer: A Self-Attention Model for Short-Time Human Action
  Recognition
        - URL: http://arxiv.org/abs/2107.00606v2
- Date: Fri, 2 Jul 2021 09:33:48 GMT
- Title: Action Transformer: A Self-Attention Model for Short-Time Human Action
  Recognition
- Authors: Vittorio Mazzia, Simone Angarano, Francesco Salvetti, Federico
  Angelini and Marcello Chiaberge
- Abstract summary: Action Transformer (AcT) is a self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers.
AcT exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance.
- Score: 5.123810256000945
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract:   Deep neural networks based purely on attention have been successful across
several domains, relying on minimal architectural priors from the designer. In
Human Action Recognition (HAR), attention mechanisms have been primarily
adopted on top of standard convolutional or recurrent layers, improving the
overall generalization capability. In this work, we introduce Action
Transformer (AcT), a simple, fully self-attentional architecture that
consistently outperforms more elaborated networks that mix convolutional,
recurrent, and attentive layers. In order to limit computational and energy
requests, building on previous human action recognition research, the proposed
approach exploits 2D pose representations over small temporal windows,
providing a low latency solution for accurate and effective real-time
performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as
an attempt to build a formal training and evaluation benchmark for real-time
short-time human action recognition. Extensive experimentation on MPOSE2021
with our proposed methodology and several previous architectural solutions
proves the effectiveness of the AcT model and poses the base for future work on
HAR.
 
      
        Related papers
        - Less is More: Efficient Weight Farcasting with 1-Layer Neural Network [18.765677644342098]
 We introduce a novel framework which diverges from conventional approaches by leveraging long-term time series forecasting techniques.<n>Our method capitalizes solely on initial and final weight values, offering a streamlined alternative for complex model architectures.<n> Empirical evaluations conducted on synthetic weight sequences and real-world deep learning architectures, including the prominent large language model DistilBERT, demonstrate the superiority of our method.
 arXiv  Detail & Related papers  (2025-05-05T15:10:20Z)
- Learning Transformer-based World Models with Contrastive Predictive   Coding [58.0159270859475]
 We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers.
We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding.
TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
 arXiv  Detail & Related papers  (2025-03-06T13:18:37Z)
- VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by
  Visual-Semantic Fusion for Egocentric Action Anticipation [33.41226268323332]
 Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions in the first-person view.
Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network.
We propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
 arXiv  Detail & Related papers  (2023-07-08T06:49:54Z)
- Surrogate-assisted Multi-objective Neural Architecture Search for
  Real-time Semantic Segmentation [11.866947846619064]
 neural architecture search (NAS) has emerged as a promising avenue toward automating the design of architectures.
We propose a surrogate-assisted multi-objective method to address the challenges of applying NAS to semantic segmentation.
Our method can identify architectures significantly outperforming existing state-of-the-art architectures designed both manually by human experts and automatically by other NAS methods.
 arXiv  Detail & Related papers  (2022-08-14T10:18:51Z)
- Human Activity Recognition Using Cascaded Dual Attention CNN and
  Bi-Directional GRU Framework [3.3721926640077795]
 Vision-based human activity recognition has emerged as one of the essential research areas in video analytics domain.
This paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for human activity recognition.
The proposed framework attains an improvement in execution time up to 167 times in terms of frames per second as compared to most of the contemporary action recognition methods.
 arXiv  Detail & Related papers  (2022-08-09T20:34:42Z)
- Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
  Action Recognition [88.34182299496074]
 Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
 arXiv  Detail & Related papers  (2022-07-17T07:05:39Z)
- Transformer Inertial Poser: Attention-based Real-time Human Motion
  Reconstruction from Sparse IMUs [79.72586714047199]
 We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time.
Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
 arXiv  Detail & Related papers  (2022-03-29T16:24:52Z)
- ProFormer: Learning Data-efficient Representations of Body Movement with
  Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
 Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
 arXiv  Detail & Related papers  (2022-02-23T11:11:54Z)
- Dynamic Iterative Refinement for Efficient 3D Hand Pose Estimation [87.54604263202941]
 We propose a tiny deep neural network of which partial layers are iteratively exploited for refining its previous estimations.
We employ learned gating criteria to decide whether to exit from the weight-sharing loop, allowing per-sample adaptation in our model.
Our method consistently outperforms state-of-the-art 2D/3D hand pose estimation approaches in terms of both accuracy and efficiency for widely used benchmarks.
 arXiv  Detail & Related papers  (2021-11-11T23:31:34Z)
- STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
 This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
 Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
 arXiv  Detail & Related papers  (2021-07-15T02:53:11Z)
- Joint Learning of Neural Transfer and Architecture Adaptation for Image
  Recognition [77.95361323613147]
 Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset.
In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness.
Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
 arXiv  Detail & Related papers  (2021-03-31T08:15:17Z)
- Learning Long-term Visual Dynamics with Region Proposal Interaction
  Networks [75.06423516419862]
 We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
 arXiv  Detail & Related papers  (2020-08-05T17:48:00Z)
- Attention-Based Deep Learning Framework for Human Activity Recognition
  with User Adaptation [5.629161809575013]
 Sensor-based human activity recognition (HAR) requires to predict the action of a person based on sensor-generated time series data.
We propose a novel deep learning framework, algname, based on a purely attention-based mechanism.
We show that our proposed attention-based architecture is considerably more powerful than previous approaches.
 arXiv  Detail & Related papers  (2020-06-06T09:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.