Revisiting 3D ResNets for Video Recognition
- URL: http://arxiv.org/abs/2109.01696v1
- Date: Fri, 3 Sep 2021 18:27:52 GMT
- Title: Revisiting 3D ResNets for Video Recognition
- Authors: Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello
- Abstract summary: This note studies effective training and scaling strategies for video recognition models.
We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes.
- Score: 18.91688307058961
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A recent work from Bello shows that training and scaling strategies may be
more significant than model architectures for visual recognition. This short
note studies effective training and scaling strategies for video recognition
models. We propose a simple scaling strategy for 3D ResNets, in combination
with improved training strategies and minor architectural changes. The
resulting models, termed 3D ResNet-RS, attain competitive performance of 81.0
on Kinetics-400 and 83.8 on Kinetics-600 without pre-training. When pre-trained
on a large Web Video Text dataset, our best model achieves 83.5 and 84.3 on
Kinetics-400 and Kinetics-600. The proposed scaling rule is further evaluated
in a self-supervised setup using contrastive learning, demonstrating improved
performance. Code is available at:
https://github.com/tensorflow/models/tree/master/official.
Related papers
- ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Training Efficient CNNS: Tweaking the Nuts and Bolts of Neural Networks
for Lighter, Faster and Robust Models [0.0]
We demonstrate how an efficient deep convolution network can be built in a phased manner by sequentially reducing the number of training parameters.
We achieved a SOTA accuracy of 99.2% on MNIST data with just 1500 parameters and an accuracy of 86.01% with just over 140K parameters on the CIFAR-10 dataset.
arXiv Detail & Related papers (2022-05-23T13:51:06Z) - Optimization Planning for 3D ConvNets [123.43419144051703]
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme.
We decompose the path into a series of training "states" and specify the hyper- parameters, e.g., learning rate and the length of input clips, in each state.
We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path.
arXiv Detail & Related papers (2022-01-11T16:13:31Z) - Learning Compositional Shape Priors for Few-Shot 3D Reconstruction [36.40776735291117]
We show that complex encoder-decoder architectures exploit large amounts of per-category data.
We propose three ways to learn a class-specific global shape prior, directly from data.
Experiments on the popular ShapeNet dataset show that our method outperforms a zero-shot baseline by over 40%.
arXiv Detail & Related papers (2021-06-11T14:55:49Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Revisiting ResNets: Improved Training and Scaling Strategies [54.0162571976267]
Training and scaling strategies may matter more than architectural changes, and the resulting ResNets match recent state-of-the-art models.
We show that the best performing scaling strategy depends on the training regime.
We design a family of ResNet architectures, ResNet-RS, which are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.
arXiv Detail & Related papers (2021-03-13T00:18:19Z) - Self-Supervised Pretraining of 3D Features on any Point-Cloud [40.26575888582241]
We present a simple self-supervised pertaining method that can work with any 3D data without 3D registration.
We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining.
arXiv Detail & Related papers (2021-01-07T18:55:21Z) - Omni-sourced Webly-supervised Learning for Video Recognition [74.3637061856504]
We introduce OmniSource, a framework for leveraging web data to train video recognition models.
Experiments show that by utilizing data from multiple sources and formats, OmniSource is more data-efficient in training.
arXiv Detail & Related papers (2020-03-29T14:47:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.