A Real-time Action Representation with Temporal Encoding and Deep
Compression
- URL: http://arxiv.org/abs/2006.09675v1
- Date: Wed, 17 Jun 2020 06:30:43 GMT
- Title: A Real-time Action Representation with Temporal Encoding and Deep
Compression
- Authors: Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, Chuang Gan
- Abstract summary: We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
- Score: 115.3739774920845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks have achieved remarkable success for video-based action
recognition. However, most of existing approaches cannot be deployed in
practice due to the high computational cost. To address this challenge, we
propose a new real-time convolutional architecture, called Temporal
Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video
action representations in a hierarchical multi-granularity manner while
obtaining a high process speed. Specifically, we propose a residual 3D
Convolutional Neural Network (CNN) to capture complementary information on the
appearance of a single frame and the motion between consecutive frames. Based
on this CNN, we develop a new temporal encoding method to explore the temporal
dynamics of the whole video. Furthermore, we integrate deep compression
techniques with T-C3D to further accelerate the deployment of models via
reducing the size of the model. By these means, heavy calculations can be
avoided when doing the inference, which enables the method to deal with videos
beyond real-time speed while keeping promising performance. Our method achieves
clear improvements on UCF101 action recognition benchmark against
state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times
faster in terms of inference speed with a less than 5MB storage model. We
validate our approach by studying its action representation performance on four
different benchmarks over three different tasks. Extensive experiments
demonstrate comparable recognition performance to the state-of-the-art methods.
The source code and the pre-trained models are publicly available at
https://github.com/tc3d.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - F4D: Factorized 4D Convolutional Neural Network for Efficient
Video-level Representation Learning [4.123763595394021]
Most existing 3D convolutional neural network (CNN)-based methods for video-level representation learning are clip-based.
We propose a factorized 4D CNN architecture with attention (F4D) that is capable of learning more effective, finer-grained, long-termtemporal video representations.
arXiv Detail & Related papers (2023-11-28T19:21:57Z) - SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network [1.4732811715354455]
We introduce a novel approach for 3D human action recognition, denoted as SpATr (Spiral Auto-encoder and Transformer Network)
A lightweight auto-encoder, based on spiral convolutions, is employed to extract spatial geometrical features from each 3D mesh.
The proposed method is evaluated on three prominent 3D human action datasets: Babel, MoVi, and BMLrub.
arXiv Detail & Related papers (2023-06-30T11:49:00Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - In Defense of Image Pre-Training for Spatiotemporal Recognition [32.56468478601864]
Key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features.
New pipeline consistently achieves better results on video recognition with significant speedup.
arXiv Detail & Related papers (2022-05-03T18:45:44Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - MoViNets: Mobile Video Networks for Efficient Video Recognition [52.49314494202433]
3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets.
We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs.
arXiv Detail & Related papers (2021-03-21T23:06:38Z) - 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video
Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network.
We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z) - An Information-rich Sampling Technique over Spatio-Temporal CNN for
Classification of Human Actions in Videos [5.414308305392762]
We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier.
In this paper, a 3D CNN architecture is proposed to extract featuresweighted and follows Long Short-Term Memory (LSTM) to recognize human actions.
Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.
arXiv Detail & Related papers (2020-02-06T05:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.