AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition
- URL: http://arxiv.org/abs/2209.13465v1
- Date: Tue, 27 Sep 2022 15:30:52 GMT
- Title: AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition
- Authors: Yulin Wang, Yang Yue, Xinhong Xu, Ali Hassani, Victor Kulikov, Nikita
Orlov, Shiji Song, Humphrey Shi, Gao Huang
- Abstract summary: This paper explores the unified formulation of spatial-temporal dynamic on top of the recently proposed AdaFocusV2 algorithm.
AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the computation of deep features.
- Score: 44.10959567844497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has revealed that reducing the temporal and spatial
redundancy are both effective approaches towards efficient video recognition,
e.g., allocating the majority of computation to a task-relevant subset of
frames or the most valuable image regions of each frame. However, in most
existing works, either type of redundancy is typically modeled with another
absent. This paper explores the unified formulation of spatial-temporal dynamic
computation on top of the recently proposed AdaFocusV2 algorithm, contributing
to an improved AdaFocusV3 framework. Our method reduces the computational cost
by activating the expensive high-capacity network only on some small but
informative 3D video cubes. These cubes are cropped from the space formed by
frame height, width, and video duration, while their locations are adaptively
determined with a light-weighted policy network on a per-sample basis. At test
time, the number of the cubes corresponding to each video is dynamically
configured, i.e., video cubes are processed sequentially until a sufficiently
reliable prediction is produced. Notably, AdaFocusV3 can be effectively trained
by approximating the non-differentiable cropping operation with the
interpolation of deep features. Extensive empirical results on six benchmark
datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2
and Diving48) demonstrate that our model is considerably more efficient than
competitive baselines.
Related papers
- EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition [0.0]
We present an efficient pose-driven attention-guided multimodal action recognition (EPAM-Net) for action recognition in videos.
Specifically, we adapted X3D networks for both pose streams and network-temporal features from RGB videos and their skeleton sequences.
Our model provides a 6.2-9.9-x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9--9.6x reduction in the number of network parameters.
arXiv Detail & Related papers (2024-08-10T03:15:24Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.