Related papers: Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition

Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition

URL: http://arxiv.org/abs/2001.05661v1
Date: Thu, 16 Jan 2020 05:49:13 GMT
Title: Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition
Authors: Li Tao, Xueting Wang, Toshihiko Yamasaki
Abstract summary: We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 20.5% and 12.5% points improvements over top-1 accuracy can be achieved. Because residual frames contain little information of object appearance, we further use a 2D convolutional network to extract appearance features.
Score: 43.002621928500425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, 3D convolutional networks yield good performance in action recognition. However, optical flow stream is still needed to ensure better performance, the cost of which is very high. In this paper, we propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 20.5% and 12.5% points improvements over top-1 accuracy can be achieved on the UCF101 and HMDB51 datasets when trained from scratch. Because residual frames contain little information of object appearance, we further use a 2D convolutional network to extract appearance features and combine them with the results from residual frames to form a two-path solution. In three benchmark datasets, our two-path solution achieved better or comparable performances than those using additional optical flow methods, especially outperformed the state-of-the-art models on Mini-kinetics dataset. Further analysis indicates that better motion features can be extracted using residual frames with 3D ConvNets, and our residual-frame-input path is a good supplement for existing RGB-frame-input models.

Related papers

An Efficient 3D Convolutional Neural Network with Channel-wise, Spatial-grouped, and Temporal Convolutions [3.798710743290466]
We introduce a simple and very efficient 3D convolutional neural network for video action recognition. We evaluate the performance and efficiency of our proposed network on several video action recognition datasets.
arXiv Detail & Related papers (2025-03-02T08:47:06Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z)
2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z)
Towards Fast, Accurate and Stable 3D Dense Face Alignment [73.01620081047336]
We propose a novel regression framework named 3DDFA-V2 which makes a balance among speed, accuracy and stability. We present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving.
arXiv Detail & Related papers (2020-09-21T15:37:37Z)
Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition [10.185425416255294]
We propose to use residual frames as an alternative "lightweight" motion representation. We also develop a new pseudo-3D convolution module which decouples 3D convolution into 2D and 1D convolution.
arXiv Detail & Related papers (2020-08-03T17:40:17Z)
Motion Representation Using Residual Frames with 3D CNN [43.002621928500425]
We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets. By replacing traditional stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over top-1 accuracy can be obtained.
arXiv Detail & Related papers (2020-06-21T07:35:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.