In Defense of Image Pre-Training for Spatiotemporal Recognition
- URL: http://arxiv.org/abs/2205.01721v1
- Date: Tue, 3 May 2022 18:45:44 GMT
- Title: In Defense of Image Pre-Training for Spatiotemporal Recognition
- Authors: Xianhang Li, Huiyu Wang, Chen Wei, Jieru Mei, Alan Yuille, Yuyin Zhou,
and Cihang Xie
- Abstract summary: Key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features.
New pipeline consistently achieves better results on video recognition with significant speedup.
- Score: 32.56468478601864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image pre-training, the current de-facto paradigm for a wide range of visual
tasks, is generally less favored in the field of video recognition. By
contrast, a common strategy is to directly train with spatiotemporal
convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly,
by taking a closer look at these from-scratch learned CNNs, we note there exist
certain 3D kernels that exhibit much stronger appearance modeling ability than
others, arguably suggesting appearance information is already well disentangled
in learning. Inspired by this observation, we hypothesize that the key to
effectively leveraging image pre-training lies in the decomposition of learning
spatial and temporal features, and revisiting image pre-training as the
appearance prior to initializing 3D kernels. In addition, we propose
Spatial-Temporal Separable (STS) convolution, which explicitly splits the
feature channels into spatial and temporal groups, to further enable a more
thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our
experiments show that simply replacing 3D convolution with STS notably improves
a wide range of 3D CNNs without increasing parameters and computation on both
Kinetics-400 and Something-Something V2. Moreover, this new training pipeline
consistently achieves better results on video recognition with significant
speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over
the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with
4 GPUs. The code and models are available at
https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.
Related papers
- Splatter Image: Ultra-Fast Single-View 3D Reconstruction [67.96212093828179]
Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images.
We learn a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS.
On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works.
arXiv Detail & Related papers (2023-12-20T16:14:58Z) - Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video
Recognition [25.364148451584356]
3D convolution neural networks (CNNs) have been the prevailing option for video recognition.
We propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach.
Experiments on Something-Something V1&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-03-05T15:11:53Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - 3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture.
SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z) - Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z) - Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human
Action Recognition [42.400429835080416]
Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting and most importantly, there is a need to improve their feature learning capabilities.
We propose new class of convolutional blocks that can serve as an alternative to 3D convolutional layer and its variants in 3D CNNs.
Our evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of
arXiv Detail & Related papers (2020-07-22T12:26:04Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.