Related papers: Optimization Planning for 3D ConvNets

Optimization Planning for 3D ConvNets

URL: http://arxiv.org/abs/2201.04021v1
Date: Tue, 11 Jan 2022 16:13:31 GMT
Title: Optimization Planning for 3D ConvNets
Authors: Zhaofan Qiu and Ting Yao and Chong-Wah Ngo and Tao Mei
Abstract summary: It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. We decompose the path into a series of training "states" and specify the hyper- parameters, e.g., learning rate and the length of input clips, in each state. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path.
Score: 123.43419144051703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video clips and then is followed by learning long-term temporal dependency using lengthy clips, while gradually decaying the learning rate from high to low as training progresses. The fact that such process comes along with several heuristic settings motivates the study to seek an optimal "path" to automate the entire training. In this paper, we decompose the path into a series of training "states" and specify the hyper-parameters, e.g., learning rate and the length of input clips, in each state. The estimation of the knee point on the performance-epoch curve triggers the transition from one state to another. We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path. Furthermore, we devise a new 3D ConvNets with a unique design of dual-head classifier to improve spatial and temporal discrimination. Extensive experiments on seven public video recognition benchmarks demonstrate the advantages of our proposal. With the optimization planning, our 3D ConvNets achieves superior results when comparing to the state-of-the-art recognition methods. More remarkably, we obtain the top-1 accuracy of 80.5% and 82.7% on Kinetics-400 and Kinetics-600 datasets, respectively. Source code is available at https://github.com/ZhaofanQiu/Optimization-Planning-for-3D-ConvNets.

Related papers

Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
3D-CSL: self-supervised 3D context similarity learning for Near-Duplicate Video Retrieval [17.69904571043164]
We introduce 3D-SL, a compact pipeline for Near-Duplicate Video Retrieval (NDVR) We propose a two-stage self-supervised similarity learning strategy to optimize the network. Our method achieves the state-of-the-art performance on clip-level NDVR.
arXiv Detail & Related papers (2022-11-10T05:51:08Z)
DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection [113.5418064456229]
We propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features. It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost.
arXiv Detail & Related papers (2022-07-22T07:08:42Z)
2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z)
Human Body Model Fitting by Learned Gradient Descent [48.79414884222403]
We propose a novel algorithm for the fitting of 3D human shape to images. We show that this algorithm is fast (avg. 120ms convergence), robust to dataset, and achieves state-of-the-art results on public evaluation datasets.
arXiv Detail & Related papers (2020-08-19T14:26:47Z)
Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations [15.659251804042748]
Woodpecker-DL (WPK) is a hardware-aware deep learning framework. WPK uses graph optimization, automated searches, domain-specific language ( DSL) and system-level exploration to accelerate inference. We show that on a maximum P100 GPU, we can achieve the speedup of 5.40 over cuDNN and 1.63 over TVM on individual operators, and run up to 1.18 times faster than TeslaRT for end-to-end model inference.
arXiv Detail & Related papers (2020-08-11T07:50:34Z)
Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos. Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation [11.10636117512819]
We propose a directional deep embedding and YouTube appearance learning (DEmbed) method, which is free of the online fine-tuning process. Our method achieves a state-of-the-art VOS performance without using online fine-tuning.
arXiv Detail & Related papers (2020-02-17T01:51:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.