Rethinking Motion Representation: Residual Frames with 3D ConvNets for
Better Action Recognition
- URL: http://arxiv.org/abs/2001.05661v1
- Date: Thu, 16 Jan 2020 05:49:13 GMT
- Title: Rethinking Motion Representation: Residual Frames with 3D ConvNets for
Better Action Recognition
- Authors: Li Tao, Xueting Wang, Toshihiko Yamasaki
- Abstract summary: We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.
By replacing traditional stacked RGB frames with residual ones, 20.5% and 12.5% points improvements over top-1 accuracy can be achieved.
Because residual frames contain little information of object appearance, we further use a 2D convolutional network to extract appearance features.
- Score: 43.002621928500425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, 3D convolutional networks yield good performance in action
recognition. However, optical flow stream is still needed to ensure better
performance, the cost of which is very high. In this paper, we propose a fast
but effective way to extract motion features from videos utilizing residual
frames as the input data in 3D ConvNets. By replacing traditional stacked RGB
frames with residual ones, 20.5% and 12.5% points improvements over top-1
accuracy can be achieved on the UCF101 and HMDB51 datasets when trained from
scratch. Because residual frames contain little information of object
appearance, we further use a 2D convolutional network to extract appearance
features and combine them with the results from residual frames to form a
two-path solution. In three benchmark datasets, our two-path solution achieved
better or comparable performances than those using additional optical flow
methods, especially outperformed the state-of-the-art models on Mini-kinetics
dataset. Further analysis indicates that better motion features can be
extracted using residual frames with 3D ConvNets, and our residual-frame-input
path is a good supplement for existing RGB-frame-input models.
Related papers
- You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video
Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network.
We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z) - Towards Fast, Accurate and Stable 3D Dense Face Alignment [73.01620081047336]
We propose a novel regression framework named 3DDFA-V2 which makes a balance among speed, accuracy and stability.
We present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving.
arXiv Detail & Related papers (2020-09-21T15:37:37Z) - Residual Frames with Efficient Pseudo-3D CNN for Human Action
Recognition [10.185425416255294]
We propose to use residual frames as an alternative "lightweight" motion representation.
We also develop a new pseudo-3D convolution module which decouples 3D convolution into 2D and 1D convolution.
arXiv Detail & Related papers (2020-08-03T17:40:17Z) - Motion Representation Using Residual Frames with 3D CNN [43.002621928500425]
We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.
By replacing traditional stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over top-1 accuracy can be obtained.
arXiv Detail & Related papers (2020-06-21T07:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.