Related papers: Adaptive Focus for Efficient Video Recognition

Adaptive Focus for Efficient Video Recognition

URL: http://arxiv.org/abs/2105.03245v1
Date: Fri, 7 May 2021 13:24:47 GMT
Title: Adaptive Focus for Efficient Video Recognition
Authors: Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, Gao Huang
Abstract summary: We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus) A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
Score: 29.615394426035074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we explore the spatial redundancy in video recognition with the aim to improve the computational efficiency. It is observed that the most informative region in each frame of a video is usually a small image patch, which shifts smoothly across frames. Therefore, we model the patch localization problem as a sequential decision task, and propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus). In specific, a light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions. Then the selected patches are inferred by a high-capacity network for the final prediction. During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices. In addition, we demonstrate that the proposed method can be easily extended by further considering the temporal redundancy, e.g., dynamically skipping less valuable frames. Extensive experiments on five benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, demonstrate that our method is significantly more efficient than the competitive baselines. Code will be available at https://github.com/blackfeather-wang/AdaFocus.

Related papers

Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video Recognition [82.75714185083383]
This paper investigates the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Motivated by this phenomenon, we introduce a spatially adaptive video recognition approach, termed AdaFocus. Our resulting framework, Uni-AdaFocus, establishes a comprehensive framework that integrates seamlessly spatial, temporal, and sample-wise dynamic computation.
arXiv Detail & Related papers (2024-12-15T15:51:44Z)
SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation. SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z)
ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams. To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z)
Deep Unsupervised Key Frame Extraction for Efficient Video Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC) The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition [44.10959567844497]
This paper explores the unified formulation of spatial-temporal dynamic on top of the recently proposed AdaFocusV2 algorithm. AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the computation of deep features.
arXiv Detail & Related papers (2022-09-27T15:30:52Z)
NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames. NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z)
Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net) AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
Scene-Adaptive Video Frame Interpolation via Meta-Learning [54.87696619177496]
We propose to adapt the model to each video by making use of additional information that is readily available at test time. We obtain significant performance gains with only a single gradient update without any additional parameters.
arXiv Detail & Related papers (2020-04-02T02:46:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.