Motion Representation Using Residual Frames with 3D CNN
- URL: http://arxiv.org/abs/2006.13017v1
- Date: Sun, 21 Jun 2020 07:35:41 GMT
- Title: Motion Representation Using Residual Frames with 3D CNN
- Authors: Li Tao, Xueting Wang, Toshihiko Yamasaki
- Abstract summary: We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.
By replacing traditional stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over top-1 accuracy can be obtained.
- Score: 43.002621928500425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, 3D convolutional networks (3D ConvNets) yield good performance in
action recognition. However, optical flow stream is still needed to ensure
better performance, the cost of which is very high. In this paper, we propose a
fast but effective way to extract motion features from videos utilizing
residual frames as the input data in 3D ConvNets. By replacing traditional
stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over
top-1 accuracy can be obtained on the UCF101 and HMDB51 datasets when ResNet-18
models are trained from scratch. And we achieved the state-of-the-art results
in this training mode. Analysis shows that better motion features can be
extracted using residual frames compared to RGB counterpart. By combining with
a simple appearance path, our proposal can be even better than some methods
using optical flow streams.
Related papers
- Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - Neural Residual Flow Fields for Efficient Video Representations [5.904082461511478]
Implicit neural representation (INR) has emerged as a powerful paradigm for representing signals, such as images, videos, 3D shapes, etc.
We propose a novel INR approach to representing and compressing videos by explicitly removing data redundancy.
We show that the proposed method outperforms the baseline methods by a significant margin.
arXiv Detail & Related papers (2022-01-12T06:22:09Z) - MoViNets: Mobile Video Networks for Efficient Video Recognition [52.49314494202433]
3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets.
We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs.
arXiv Detail & Related papers (2021-03-21T23:06:38Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - Challenge report:VIPriors Action Recognition Challenge [14.080142383692417]
Action recognition has attracted many researchers attention for its full application, but it is still challenging.
In this paper, we study previous methods and propose our method.
We use a fast but effective way to extract motion features from videos by using residual frames as input.
arXiv Detail & Related papers (2020-07-16T08:40:31Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - Rethinking Motion Representation: Residual Frames with 3D ConvNets for
Better Action Recognition [43.002621928500425]
We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.
By replacing traditional stacked RGB frames with residual ones, 20.5% and 12.5% points improvements over top-1 accuracy can be achieved.
Because residual frames contain little information of object appearance, we further use a 2D convolutional network to extract appearance features.
arXiv Detail & Related papers (2020-01-16T05:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.