STAR: Sparse Transformer-based Action Recognition
- URL: http://arxiv.org/abs/2107.07089v1
- Date: Thu, 15 Jul 2021 02:53:11 GMT
- Title: STAR: Sparse Transformer-based Action Recognition
- Authors: Feng Shi, Chonghan Lee, Liang Qiu, Yizhou Zhao, Tianyi Shen, Shivran
Muralidhar, Tian Han, Song-Chun Zhu, Vijaykrishnan Narayanan
- Abstract summary: This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
- Score: 61.490243467748314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The cognitive system for human action and behavior has evolved into a deep
learning regime, and especially the advent of Graph Convolution Networks has
transformed the field in recent years. However, previous works have mainly
focused on over-parameterized and complex models based on dense graph
convolution networks, resulting in low efficiency in training and inference.
Meanwhile, the Transformer architecture-based model has not yet been well
explored for cognitive application in human action and behavior estimation.
This work proposes a novel skeleton-based human action recognition model with
sparse attention on the spatial dimension and segmented linear attention on the
temporal dimension of data. Our model can also process the variable length of
video clips grouped as a single batch. Experiments show that our model can
achieve comparable performance while utilizing much less trainable parameters
and achieve high speed in training and inference. Experiments show that our
model achieves 4~18x speedup and 1/7~1/15 model size compared with the baseline
models at competitive accuracy.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Texture-Based Input Feature Selection for Action Recognition [3.9596068699962323]
We propose a novel method to determine the task-irrelevant content in inputs which increases the domain discrepancy.
We show that our proposed model is superior to existing models for action recognition on the HMDB-51 dataset and the Penn Action dataset.
arXiv Detail & Related papers (2023-02-28T23:56:31Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Graph-based Normalizing Flow for Human Motion Generation and
Reconstruction [20.454140530081183]
We propose a probabilistic generative model to synthesize and reconstruct long horizon motion sequences conditioned on past information and control signals.
We evaluate the models on a mixture of motion capture datasets of human locomotion with foot-step and bone-length analysis.
arXiv Detail & Related papers (2021-04-07T09:51:15Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - A Compact Deep Architecture for Real-time Saliency Prediction [42.58396452892243]
Saliency models aim to imitate the attention mechanism in the human visual system.
Deep models have a high number of parameters which makes them less suitable for real-time applications.
Here we propose a compact yet fast model for real-time saliency prediction.
arXiv Detail & Related papers (2020-08-30T17:47:16Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.