AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
- URL: http://arxiv.org/abs/2007.15796v1
- Date: Fri, 31 Jul 2020 01:36:04 GMT
- Title: AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
- Authors: Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid
Karlinsky, Aude Oliva, Kate Saenko, and Rogerio Feris
- Abstract summary: Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
- Score: 70.62587948892633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition is an open and challenging problem in computer vision.
While current state-of-the-art models offer excellent recognition results,
their computational expense limits their impact for many real-world
applications. In this paper, we propose a novel approach, called AR-Net
(Adaptive Resolution Network), that selects on-the-fly the optimal resolution
for each frame conditioned on the input for efficient action recognition in
long untrimmed videos. Specifically, given a video frame, a policy network is
used to decide what input resolution should be used for processing by the
action recognition model, with the goal of improving both accuracy and
efficiency. We efficiently train the policy network jointly with the
recognition model using standard back-propagation. Extensive experiments on
several challenging action recognition benchmark datasets well demonstrate the
efficacy of our proposed approach over state-of-the-art methods. The project
page can be found at https://mengyuest.github.io/AR-Net
Related papers
- Rethinking Resolution in the Context of Efficient Video Recognition [49.957690643214576]
Cross-resolution KD (ResKD) is a simple but effective method to boost recognition accuracy on low-resolution frames.
We extensively demonstrate its effectiveness over state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers.
arXiv Detail & Related papers (2022-09-26T15:50:44Z) - Efficient Human Vision Inspired Action Recognition using Adaptive
Spatiotemporal Sampling [13.427887784558168]
We introduce a novel adaptive vision system for efficient action recognition processing.
Our system pre-scans the global context sampling scheme at low-resolution and decides to skip or request high-resolution features at salient regions for further processing.
We validate the system on EPIC-KENS and UCF-101 datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-the-art baselines.
arXiv Detail & Related papers (2022-07-12T01:18:58Z) - FasterVideo: Efficient Online Joint Object Detection And Tracking [0.8680676599607126]
We re-think one of the most successful methods for image object detection, Faster R-CNN, and extend it to the video domain.
Our proposed method reaches a very high computational efficiency necessary for relevant applications.
arXiv Detail & Related papers (2022-04-15T09:25:34Z) - Dynamic Network Quantization for Efficient Video Inference [60.109250720206425]
We propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition.
We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency.
arXiv Detail & Related papers (2021-08-23T20:23:57Z) - AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [61.51188561808917]
We propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition.
We show that our proposed approach yields 35%-55% reduction in computation when compared to the traditional baseline.
arXiv Detail & Related papers (2021-05-11T16:19:07Z) - Video Face Super-Resolution with Motion-Adaptive Feedback Cell [90.73821618795512]
Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN)
In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way.
arXiv Detail & Related papers (2020-02-15T13:14:10Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.