Rethinking Resolution in the Context of Efficient Video Recognition
- URL: http://arxiv.org/abs/2209.12797v1
- Date: Mon, 26 Sep 2022 15:50:44 GMT
- Title: Rethinking Resolution in the Context of Efficient Video Recognition
- Authors: Chuofan Ma, Qiushan Guo, Yi Jiang, Zehuan Yuan, Ping Luo, Xiaojuan Qi
- Abstract summary: Cross-resolution KD (ResKD) is a simple but effective method to boost recognition accuracy on low-resolution frames.
We extensively demonstrate its effectiveness over state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers.
- Score: 49.957690643214576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we empirically study how to make the most of low-resolution
frames for efficient video recognition. Existing methods mainly focus on
developing compact networks or alleviating temporal redundancy of video inputs
to increase efficiency, whereas compressing frame resolution has rarely been
considered a promising solution. A major concern is the poor recognition
accuracy on low-resolution frames. We thus start by analyzing the underlying
causes of performance degradation on low-resolution frames. Our key finding is
that the major cause of degradation is not information loss in the
down-sampling process, but rather the mismatch between network architecture and
input scale. Motivated by the success of knowledge distillation (KD), we
propose to bridge the gap between network and input size via cross-resolution
KD (ResKD). Our work shows that ResKD is a simple but effective method to boost
recognition accuracy on low-resolution frames. Without bells and whistles,
ResKD considerably surpasses all competitive methods in terms of efficiency and
accuracy on four large-scale benchmark datasets, i.e., ActivityNet, FCVID,
Mini-Kinetics, Something-Something V2. In addition, we extensively demonstrate
its effectiveness over state-of-the-art architectures, i.e., 3D-CNNs and Video
Transformers, and scalability towards super low-resolution frames. The results
suggest ResKD can serve as a general inference acceleration method for
state-of-the-art video recognition. Our code will be available at
https://github.com/CVMI-Lab/ResKD.
Related papers
- HRDecoder: High-Resolution Decoder Network for Fundus Image Lesion Segmentation [12.606794661369959]
We propose HRDecoder, a simple High-Resolution Decoder network for fundus lesion segmentation.
It integrates a high-resolution representation learning module to capture fine-grained local features and a high-resolution fusion module to fuse multi-scale predictions.
Our method effectively improves the overall segmentation accuracy of fundus lesions while consuming reasonable memory and computational overhead, and maintaining satisfying inference speed.
arXiv Detail & Related papers (2024-11-06T15:13:31Z) - Differentiable Resolution Compression and Alignment for Efficient Video
Classification and Retrieval [16.497758750494537]
We propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism.
We leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features.
We introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions.
arXiv Detail & Related papers (2023-09-15T05:31:53Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - Fast Online Video Super-Resolution with Deformable Attention Pyramid [172.16491820970646]
Video super-resolution (VSR) has many applications that pose strict causal, real-time, and latency constraints, including video streaming and TV.
We propose a recurrent VSR architecture based on a deformable attention pyramid (DAP)
arXiv Detail & Related papers (2022-02-03T17:49:04Z) - Super-Resolving Compressed Video in Coding Chain [27.994055823226848]
We present a mixed-resolution coding framework, which cooperates with a reference-based DCNN.
In this novel coding chain, the reference-based DCNN learns the direct mapping from low-resolution (LR) compressed video to their high-resolution (HR) clean version at the decoder side.
arXiv Detail & Related papers (2021-03-26T03:39:54Z) - AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z) - Deep Space-Time Video Upsampling Networks [47.62807427163614]
Video super-resolution (VSR) and frame (FI) are traditional computer vision problems.
We propose an end-to-end framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework.
Results show better results both quantitatively and qualitatively, while reducing the time (x7 faster) and the number of parameters (30%) compared to baselines.
arXiv Detail & Related papers (2020-04-06T07:04:21Z) - Video Face Super-Resolution with Motion-Adaptive Feedback Cell [90.73821618795512]
Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN)
In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way.
arXiv Detail & Related papers (2020-02-15T13:14:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.