Look More but Care Less in Video Recognition
- URL: http://arxiv.org/abs/2211.09992v1
- Date: Fri, 18 Nov 2022 02:39:56 GMT
- Title: Look More but Care Less in Video Recognition
- Authors: Yitian Zhang, Yue Bai, Huan Wang, Yi Xu, Yun Fu
- Abstract summary: Action recognition methods typically sample a few frames to represent each video to avoid the enormous computation.
We propose Ample and Focal Network (AFNet), which is composed of two branches to utilize more frames but with less computation.
- Score: 57.96505328398205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing action recognition methods typically sample a few frames to
represent each video to avoid the enormous computation, which often limits the
recognition performance. To tackle this problem, we propose Ample and Focal
Network (AFNet), which is composed of two branches to utilize more frames but
with less computation. Specifically, the Ample Branch takes all input frames to
obtain abundant information with condensed computation and provides the
guidance for Focal Branch by the proposed Navigation Module; the Focal Branch
squeezes the temporal size to only focus on the salient frames at each
convolution block; in the end, the results of two branches are adaptively fused
to prevent the loss of information. With this design, we can introduce more
frames to the network but cost less computation. Besides, we demonstrate AFNet
can utilize fewer frames while achieving higher accuracy as the dynamic
selection in intermediate features enforces implicit temporal modeling.
Further, we show that our method can be extended to reduce spatial redundancy
with even less cost. Extensive experiments on five datasets demonstrate the
effectiveness and efficiency of our method.
Related papers
- Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - Efficient Flow-Guided Multi-frame De-fencing [7.504789972841539]
De-fencing is the algorithmic process of automatically removing such obstructions from images.
We develop a framework for multi-frame de-fencing that computes high quality flow maps directly from obstructed frames.
arXiv Detail & Related papers (2023-01-25T18:42:59Z) - Alignment-guided Temporal Attention for Video Action Recognition [18.5171795689609]
We show that frame-by-frame alignments have the potential to increase the mutual information between frame representations.
We propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames.
arXiv Detail & Related papers (2022-09-30T23:10:47Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.