AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video
Recognition
- URL: http://arxiv.org/abs/2112.14238v1
- Date: Tue, 28 Dec 2021 17:53:38 GMT
- Title: AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video
Recognition
- Authors: Yulin Wang, Yang Yue, Yuanze Lin, Haojun Jiang, Zihang Lai, Victor
Kulikov, Nikita Orlov, Humphrey Shi, Gao Huang
- Abstract summary: This work reformulates the training of AdaFocus as a simple one-stage algorithm.
We present an improved training scheme to address the issues introduced by the one-stage formulation.
Our model significantly outperforms the original AdaFocus and other competitive baselines.
- Score: 23.12743642910384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown that the computational efficiency of video
recognition can be significantly improved by reducing the spatial redundancy.
As a representative work, the adaptive focus method (AdaFocus) has achieved a
favorable trade-off between accuracy and inference speed by dynamically
identifying and attending to the informative regions in each video frame.
However, AdaFocus requires a complicated three-stage training pipeline
(involving reinforcement learning), leading to slow convergence and is
unfriendly to practitioners. This work reformulates the training of AdaFocus as
a simple one-stage algorithm by introducing a differentiable
interpolation-based patch selection operation, enabling efficient end-to-end
optimization. We further present an improved training scheme to address the
issues introduced by the one-stage formulation, including the lack of
supervision, input diversity and training stability. Moreover, a
conditional-exit technique is proposed to perform temporal adaptive computation
on top of AdaFocus without additional training. Extensive experiments on six
benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics,
Something-Something V1&V2, and Jester) demonstrate that our model significantly
outperforms the original AdaFocus and other competitive baselines, while being
considerably more simple and efficient to train. Code is available at
https://github.com/LeapLabTHU/AdaFocusV2.
Related papers
- Efficient Reinforcement Learning Through Adaptively Pretrained Visual Encoder [12.310140622800372]
We propose APE: efficient reinforcement learning through Adaptively Pretrained visual.
APE uses adaptive augmentation strategy during the pretraining phase and extracts generalizable features with only a few interactions within the task environments in the policy learning period.
Results show that mainstream RL methods, such as DreamerV3 and DrQ-v2, achieve state-of-the-art performance when equipped with APE.
arXiv Detail & Related papers (2025-02-08T12:57:02Z) - Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video Recognition [82.75714185083383]
This paper investigates the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency.
Motivated by this phenomenon, we introduce a spatially adaptive video recognition approach, termed AdaFocus.
Our resulting framework, Uni-AdaFocus, establishes a comprehensive framework that integrates seamlessly spatial, temporal, and sample-wise dynamic computation.
arXiv Detail & Related papers (2024-12-15T15:51:44Z) - Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR)
Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method.
Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video.
Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability.
We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z) - AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition [44.10959567844497]
This paper explores the unified formulation of spatial-temporal dynamic on top of the recently proposed AdaFocusV2 algorithm.
AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the computation of deep features.
arXiv Detail & Related papers (2022-09-27T15:30:52Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Adaptive Serverless Learning [114.36410688552579]
We propose a novel adaptive decentralized training approach, which can compute the learning rate from data dynamically.
Our theoretical results reveal that the proposed algorithm can achieve linear speedup with respect to the number of workers.
To reduce the communication-efficient overhead, we further propose a communication-efficient adaptive decentralized training approach.
arXiv Detail & Related papers (2020-08-24T13:23:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.